Closed MichaelLueken closed 2 months ago
@MichaelLueken, thanks for this change:
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20240419100352 COMPLETE 4829.55
----------------------------------------------------------------------------------------------------
Total COMPLETE 4829.55
Approving.
@MichaelLueken It would probably be a good idea to keep what these were meant to do, which is, in bash scripts, check whether the cp, mv, etc functions were successful. These functions originally returned an error message and stopped execution, so that every time one needs to use say cp, one doesn't have to put in a line like this:
cp file1 file2 || {print "Copy failed"; exit 1}
or something longer, and instead use
cp_vrfy file1 file2
I saw in issue #861 that:
For example, if you want to verify a file was copied with cp_vrfy, but that file does not exist, a message will be printed out, but the exit code will be set to 0, and the program will happily hum along with no failure.
Obviously, this wasn't the original intent or behavior of these functions, but this erroneous behavior was introduced somewhere along the line due to insufficient testing. So instead of just removing these, what should be done is restore the original functionality of detecting errors and exiting.
@gsketefian, as far as I know (if my memory serves me correctly), the bash commands like cp
exit with an error message when they fail (while the current _vrfy
function does not exit; this was a big issue in my previous runs). For delivery to NCO, the -p
flag should be added to the copy command like (cp -p) though. So, I think cp file1 file2
is good enough.
@@.*** I think I put in the “_vrfy” commands a long time ago because I wanted to have clearer and standardized error messages for these commands. The fact that they don’t actually exit is a bug that was introduced later, probably during the partial conversion to python. It would be great to restore that functionality to get more comprehensive error messages as before. You can still add the -p to “cp_vrfy”.
On Friday, April 19, 2024, Chan-Hoo.Jeon-NOAA @.***> wrote:
@gsketefian https://github.com/gsketefian, as far as I know (if my memory serves me correctly), the bash commands like cp exit with an error message when they fail (while the current _vrfy function does not exit; this was a big issue in my previous runs). For delivery to NCO, the -p flag should be added to the copy command like (cp -p) though. So, I think cp file1 file2 is good enough.
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-srweather-app/pull/1074#issuecomment-2067042900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM3ZYTNU3RHYBTMGJOCWHTY6FLA3AVCNFSM6AAAAABGPFNOFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGA2DEOJQGA . You are receiving this because you were mentioned.Message ID: @.***>
-- Gerard Ketefian Research Scientist NOAA/OAR/ESRL/GSD/EMB, R/GSD1 325 Broadway Boulder, CO 80305 phone: 970-703-3048
@gsketefian -
Thanks for expressing your concern on the removal of _vrfy
from the bash commands. With this PR, I was simply following through with @mkavulich's and @chan-hoo's recommendation of fully removing the _vrfy
. As @chan-hoo noted, The standard bash commands should exit automatically on a failure.
Would it be worthwhile to surround cd
, cp
, ln
, mkdir
, mv
, and rm
commands with set -x
and set +x
, so that they should error out if an issue is encountered? I'm open to more opinions and suggestions, but I don't think that we want to go back to using _vrfy
once again.
After thinking about this a bit more, it seems to me surrounding (wrapping) those commands with set -e would be the most important thing to do. I’m away from my computer and can’t check, but probably the reason an error with the cp command causes scripts to exit (as Chan-Hoo mentioned) is that those scripts have set -e declared at the top (I think default bash behavior is not to exit a script if a command fails). So if you use the wrapped versions of these commands that include set -e, then you’ll be sure they will exit on error regardless of whether the calling script uses set -e.
Just my two cents. Please proceed as you see fit Thanks.
On Friday, April 19, 2024, Michael Lueken @.***> wrote:
@gsketefian https://github.com/gsketefian -
Thanks for expressing your concern on the removal of _vrfy from the bash commands. With this PR, I was simply following through with @mkavulich https://github.com/mkavulich's and @chan-hoo https://github.com/chan-hoo's recommendation of fully removing the _vrfy. As @chan-hoo https://github.com/chan-hoo noted, The standard bash commands should exit automatically on a failure.
Would it be worthwhile to surround cd, cp, ln, mkdir, mv, and rm commands with set -x and set +x, so that they should error out if an issue is encountered? I'm open to more opinions and suggestions, but I don't think that we want to go back to using _vrfy once again.
— Reply to this email directly, view it on GitHub https://github.com/ufs-community/ufs-srweather-app/pull/1074#issuecomment-2067134397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHM3ZYVBRLPZRL6IORANMO3Y6FS55AVCNFSM6AAAAABGPFNOFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRXGEZTIMZZG4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Gerard Ketefian Research Scientist NOAA/OAR/ESRL/GSD/EMB, R/GSD1 325 Broadway Boulder, CO 80305 phone: 970-703-3048
@gsketefian Since it was my issue originally advocating for removal of these scripts I'll provide a defense: I don't see a reason to carry this burden and overhead of custom wrappers around simple shell builtins like this, and as I say in the original issue, they introduce additional problems! Unless I'm missing something, they provide no additional value over just using the -e
flag within exscripts (which is something we should think about doing!), and just add further things into the workflow that need maintenance, can cause issues, and (as they are doing now) can mask other issues.
For the record, here is an example of a failure using the shell builtin mv
:
mv: cannot stat 'fire.nml': No such file or directory
FATAL ERROR:
ERROR:
From script: "JREGIONAL_RUN_FCST"
Full path to script: "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs/JREGIONAL_RUN_FCST"
Call to ex-script corresponding to J-job "JREGIONAL_RUN_FCST" failed.
Exiting with nonzero status.
And here is with the custom mv_vrfy
, after it is "fixed" by adding set -e
:
"mv_vrfy" operation returned with a message. This command was
issued from the script in file:
""
Message from "mv_vrfy" function's "mv" operation:
mv: cannot stat 'fire.nml': No such file or directory
========================================================================
Exiting script: "JREGIONAL_RUN_FCST"
In directory: "/glade/derecho/scratch/kavulich/FIRE/2024/april_updates/ufs-srweather-app/jobs"
========================================================================
That is nothing but a whole bunch of unhelpful extra text around the same error message. It seems as if it was meant to echo the script it was called from, but again, it does not do this! And that information would be mostly redundant anyway, since exscripts all print a message when they start.
I did not realize until this PR that set -e
is not turned on in all scripts; regardless of the outcome of this discussion we should see if we need to do this as well. I did notice that certain commands (mv as one example) will fail even with the -e
flag not set.
Is this a good time to mention that converting everything to python would make this problem moot? 😉
All scripts load the ush/preamble.sh
script. Inside the ush/preamble.sh script, the default behavior is to use set -euo pipefail
. No other changes should be required, as set -euo
is already on by default for all scripts.
Addressing issue related to failures with the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
metric test, then will push updates and resubmit Jenkins tests.
After correcting the grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0
test, I have resubmitted the Jenkins tests.
Gaea, Jet, and Orion are currently not available via Jenkins, so I will go ahead and manually run the WE2E coverage tests on those machines.
The WE2E tests were manually run on Orion and all tests successfully passed:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240429084427 COMPLETE 437.36
deactivate_tasks_20240429084428 COMPLETE 0.93
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me COMPLETE 1878.41
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_ COMPLETE 1006.63
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240 COMPLETE 374.77
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202404290 COMPLETE 21.40
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429084 COMPLETE 971.98
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_ COMPLETE 64.03
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2 COMPLETE 728.04
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202404 COMPLETE 46.58
2020_CAD_20240429084435 COMPLETE 71.58
----------------------------------------------------------------------------------------------------
Total COMPLETE 5601.71
Additionally, the metric test was also ran and successfully completed:
COL_NAME: FCST_MODEL REF_MODEL N_INIT N_TERM N_VLD SS_INDEX
GO_INDEX FV3_WoFS_v0_SUBCONUS_3km_test_mem000 FV3_GFS_v16_SUBCONUS_3km 1 17 17 0.99807
The Jet WE2E coverage tests were manually run and all passed successfully:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
community_20240429155928 COMPLETE 17.51
custom_ESGgrid_20240429155929 COMPLETE 25.79
custom_ESGgrid_Great_Lakes_snow_8km_20240429155930 COMPLETE 20.44
custom_GFDLgrid_20240429155932 COMPLETE 10.79
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202404 COMPLETE 8.00
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20 COMPLETE 84.78
get_from_HPSS_ics_RAP_lbcs_RAP_20240429155935 COMPLETE 15.93
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240429155936 COMPLETE 617.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 64.04
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 7.36
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 927.52
----------------------------------------------------------------------------------------------------
Total COMPLETE 1799.52
Awaiting results from Gaea and Hera GNU now.
All Jenkins tests have successfully passed and the Gaea WE2E coverage tests were manually run and all successfully passed:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
community_20240429144427 COMPLETE 46.66
custom_ESGgrid_NewZealand_3km_20240429144429 COMPLETE 113.58
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2 COMPLETE 45.98
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240429144 COMPLETE 50.24
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024042914 COMPLETE 49.07
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson COMPLETE 589.88
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024042 COMPLETE 38.81
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20 COMPLETE 810.44
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202 COMPLETE 21.65
2020_CAPE_20240429144455 COMPLETE 50.25
----------------------------------------------------------------------------------------------------
Total COMPLETE 1816.56
Merging this work now.
DESCRIPTION OF CHANGES:
The weather model hash has been updated to 4f32a4b (April 15).
Additionally,
_vrfy
has been removed from thecd
,cp
,ln
,mkdir
,mv
, andrm
bash commands injobs
,scripts
,ush
, andush/bash_utils
. The modified commands don't function as intended (issue #861) and aren't accepted by NCO (issue #1021).Type of change
TESTS CONDUCTED:
The fundamental tests were run on all tier-1 platforms. Comprehensive tests were run on Hera Intel, Orion, and Hercules. The AQM WE2E test was run on Hera Intel and the sample warm start AQM configuration was run on Hercules.
DEPENDENCIES:
None
DOCUMENTATION:
None
ISSUE:
Fixes #861 Fixes #1021
CHECKLIST