ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

[develop] Update weather model hash and remove "_vrfy" from bash commands #1074

Closed MichaelLueken closed 2 months ago

MichaelLueken commented 2 months ago

DESCRIPTION OF CHANGES:

The weather model hash has been updated to 4f32a4b (April 15).

Additionally, _vrfy has been removed from the cd, cp, ln, mkdir, mv, and rm bash commands in jobs, scripts, ush, and ush/bash_utils. The modified commands don't function as intended (issue #861) and aren't accepted by NCO (issue #1021).

Type of change

TESTS CONDUCTED:

The fundamental tests were run on all tier-1 platforms. Comprehensive tests were run on Hera Intel, Orion, and Hercules. The AQM WE2E test was run on Hera Intel and the sample warm start AQM configuration was run on Hercules.

DEPENDENCIES:

None

DOCUMENTATION:

None

ISSUE:

Fixes #861 Fixes #1021

CHECKLIST

chan-hoo commented 2 months ago

@MichaelLueken, thanks for this change:

Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240419100352                   COMPLETE            4829.55
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            4829.55

Approving.

gsketefian commented 2 months ago

@MichaelLueken It would probably be a good idea to keep what these were meant to do, which is, in bash scripts, check whether the cp, mv, etc functions were successful. These functions originally returned an error message and stopped execution, so that every time one needs to use say cp, one doesn't have to put in a line like this:

cp file1 file2 || {print "Copy failed"; exit 1}

or something longer, and instead use

cp_vrfy file1 file2

I saw in issue #861 that:

For example, if you want to verify a file was copied with cp_vrfy, but that file does not exist, a message will be printed out, but the exit code will be set to 0, and the program will happily hum along with no failure. 

Obviously, this wasn't the original intent or behavior of these functions, but this erroneous behavior was introduced somewhere along the line due to insufficient testing. So instead of just removing these, what should be done is restore the original functionality of detecting errors and exiting.

chan-hoo commented 2 months ago

@gsketefian, as far as I know (if my memory serves me correctly), the bash commands like cp exit with an error message when they fail (while the current _vrfy function does not exit; this was a big issue in my previous runs). For delivery to NCO, the -p flag should be added to the copy command like (cp -p) though. So, I think cp file1 file2 is good enough.

gsketefian commented 2 months ago

@@.*** I think I put in the “_vrfy” commands a long time ago because I wanted to have clearer and standardized error messages for these commands. The fact that they don’t actually exit is a bug that was introduced later, probably during the partial conversion to python. It would be great to restore that functionality to get more comprehensive error messages as before. You can still add the -p to “cp_vrfy”.

On Friday, April 19, 2024, Chan-Hoo.Jeon-NOAA @.***> wrote:

@gsketefian https://github.com/gsketefian, as far as I know (if my memory serves me correctly), the bash commands like cp exit with an error message when they fail (while the current _vrfy function does not exit; this was a big issue in my previous runs). For delivery to NCO, the -p flag should be added to the copy command like (cp -p) though. So, I think cp file1 file2 is good enough.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-srweather-app/pull/1074#issuecomment-2067042900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM3ZYTNU3RHYBTMGJOCWHTY6FLA3AVCNFSM6AAAAABGPFNOFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGA2DEOJQGA . You are receiving this because you were mentioned.Message ID: @.***>

-- Gerard Ketefian Research Scientist NOAA/OAR/ESRL/GSD/EMB, R/GSD1 325 Broadway Boulder, CO 80305 phone: 970-703-3048

MichaelLueken commented 2 months ago

@gsketefian -

Thanks for expressing your concern on the removal of _vrfyfrom the bash commands. With this PR, I was simply following through with @mkavulich's and @chan-hoo's recommendation of fully removing the _vrfy. As @chan-hoo noted, The standard bash commands should exit automatically on a failure.

Would it be worthwhile to surround cd, cp, ln, mkdir, mv, and rm commands with set -x and set +x, so that they should error out if an issue is encountered? I'm open to more opinions and suggestions, but I don't think that we want to go back to using _vrfy once again.

gsketefian commented 2 months ago

After thinking about this a bit more, it seems to me surrounding (wrapping) those commands with set -e would be the most important thing to do. I’m away from my computer and can’t check, but probably the reason an error with the cp command causes scripts to exit (as Chan-Hoo mentioned) is that those scripts have set -e declared at the top (I think default bash behavior is not to exit a script if a command fails). So if you use the wrapped versions of these commands that include set -e, then you’ll be sure they will exit on error regardless of whether the calling script uses set -e.

Just my two cents. Please proceed as you see fit Thanks.

On Friday, April 19, 2024, Michael Lueken @.***> wrote:

@gsketefian https://github.com/gsketefian -

Thanks for expressing your concern on the removal of _vrfy from the bash commands. With this PR, I was simply following through with @mkavulich https://github.com/mkavulich's and @chan-hoo https://github.com/chan-hoo's recommendation of fully removing the _vrfy. As @chan-hoo https://github.com/chan-hoo noted, The standard bash commands should exit automatically on a failure.

Would it be worthwhile to surround cd, cp, ln, mkdir, mv, and rm commands with set -x and set +x, so that they should error out if an issue is encountered? I'm open to more opinions and suggestions, but I don't think that we want to go back to using _vrfy once again.

— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-srweather-app/pull/1074#issuecomment-2067134397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM3ZYVBRLPZRL6IORANMO3Y6FS55AVCNFSM6AAAAABGPFNOFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGEZTIMZZG4 . You are receiving this because you were mentioned.Message ID: @.***>

-- Gerard Ketefian Research Scientist NOAA/OAR/ESRL/GSD/EMB, R/GSD1 325 Broadway Boulder, CO 80305 phone: 970-703-3048

mkavulich commented 2 months ago

@gsketefian Since it was my issue originally advocating for removal of these scripts I'll provide a defense: I don't see a reason to carry this burden and overhead of custom wrappers around simple shell builtins like this, and as I say in the original issue, they introduce additional problems! Unless I'm missing something, they provide no additional value over just using the -e flag within exscripts (which is something we should think about doing!), and just add further things into the workflow that need maintenance, can cause issues, and (as they are doing now) can mask other issues.

For the record, here is an example of a failure using the shell builtin mv:


mv: cannot stat 'fire.nml': No such file or directory
FATAL ERROR:
ERROR:
  From script:  "JREGIONAL_RUN_FCST"
  Full path to script:  "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs/JREGIONAL_RUN_FCST"
Call to ex-script corresponding to J-job "JREGIONAL_RUN_FCST" failed.
Exiting with nonzero status.

And here is with the custom mv_vrfy, after it is "fixed" by adding set -e:


"mv_vrfy" operation returned with a message.  This command was
issued from the script in file:

  ""

Message from "mv_vrfy" function's "mv" operation:
  mv: cannot stat 'fire.nml': No such file or directory

========================================================================
Exiting script:  "JREGIONAL_RUN_FCST"
In directory:    "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs"
========================================================================

That is nothing but a whole bunch of unhelpful extra text around the same error message. It seems as if it was meant to echo the script it was called from, but again, it does not do this! And that information would be mostly redundant anyway, since exscripts all print a message when they start.

I did not realize until this PR that set -e is not turned on in all scripts; regardless of the outcome of this discussion we should see if we need to do this as well. I did notice that certain commands (mv as one example) will fail even with the -e flag not set.

Is this a good time to mention that converting everything to python would make this problem moot? 😉

MichaelLueken commented 2 months ago

All scripts load the ush/preamble.sh script. Inside the ush/preamble.sh script, the default behavior is to use set -euo pipefail. No other changes should be required, as set -euo is already on by default for all scripts.

Addressing issue related to failures with the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 metric test, then will push updates and resubmit Jenkins tests.

MichaelLueken commented 2 months ago

After correcting the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0 test, I have resubmitted the Jenkins tests.

Gaea, Jet, and Orion are currently not available via Jenkins, so I will go ahead and manually run the WE2E coverage tests on those machines.

MichaelLueken commented 2 months ago

The WE2E tests were manually run on Orion and all tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240429084427                             COMPLETE             437.36
deactivate_tasks_20240429084428                                    COMPLETE               0.93
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            1878.41
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE            1006.63
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             374.77
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202404290  COMPLETE              21.40
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429084  COMPLETE             971.98
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              64.03
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             728.04
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202404  COMPLETE              46.58
2020_CAD_20240429084435                                            COMPLETE              71.58
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            5601.71

Additionally, the metric test was also ran and successfully completed:

COL_NAME:                           FCST_MODEL                REF_MODEL N_INIT N_TERM N_VLD SS_INDEX
 GO_INDEX FV3_WoFS_v0_SUBCONUS_3km_test_mem000 FV3_GFS_v16_SUBCONUS_3km      1     17    17  0.99807
MichaelLueken commented 2 months ago

The Jet WE2E coverage tests were manually run and all passed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240429155928                                           COMPLETE              17.51
custom_ESGgrid_20240429155929                                      COMPLETE              25.79
custom_ESGgrid_Great_Lakes_snow_8km_20240429155930                 COMPLETE              20.44
custom_GFDLgrid_20240429155932                                     COMPLETE              10.79
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202404  COMPLETE               8.00
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              84.78
get_from_HPSS_ics_RAP_lbcs_RAP_20240429155935                      COMPLETE              15.93
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429155936  COMPLETE             617.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              64.04
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.36
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             927.52
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1799.52

Awaiting results from Gaea and Hera GNU now.

MichaelLueken commented 2 months ago

All Jenkins tests have successfully passed and the Gaea WE2E coverage tests were manually run and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240429144427                                           COMPLETE              46.66
custom_ESGgrid_NewZealand_3km_20240429144429                       COMPLETE             113.58
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              45.98
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240429144  COMPLETE              50.24
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024042914  COMPLETE              49.07
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             589.88
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024042  COMPLETE              38.81
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             810.44
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              21.65
2020_CAPE_20240429144455                                           COMPLETE              50.25
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1816.56

Merging this work now.