ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Bug fix to support the %H format in METplus via printf. #1102

Closed gsketefian closed 2 months ago

gsketefian commented 2 months ago

DESCRIPTION OF CHANGES:

This bug was encountered when verifying forecast output that has a 2-digit forecast hour in its name. It turns out specifying the METplus format %H to obtain a 2-digit forecast hour in the workflow/verification configuration variable FCST_FN_TEMPLATE (and others) causes an error in the shell script eval_METplus_timestr_tmpl.sh because bash's printf utility does not support the %H format. This fixes that error using a similar approach to the %HHH format for obtaining 3-digit hours.

Type of change

TESTS CONDUCTED:

The full set of WE2E tests involving vx were run on Hera. These are:

MET_ensemble_verification
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx
MET_verification
MET_verification_only_vx
MET_verification_winter_wx

All passed.

DEPENDENCIES:

None needed.

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@willmayfield and @michelleharrold encountered this bug, and @mkavulich pinpointed the script it was originating from.

gsketefian commented 2 months ago

@MichaelLueken I am running the vx WE2E tests on this now.

gsketefian commented 2 months ago

@MichaelLueken All the WE2E vx tests passed, and I noted that in the PR message. Thanks.

MichaelLueken commented 2 months ago

The Jenkins WE2E coverage tests successfully passed on all machines, with the exception of Jet, where the testing phase was aborted for running longer than 8 hours. Before the tests were aborted, there were two failures - custom_ESGgrid and custom_ESGgrid_Great_Lakes_snow_8km.

Both failures appear to be due to Slurm/Node issues on the machine.

Tasks being allocated nodes and hanging until the walltime has passed:

slurmstepd: error: *** STEP 6229197.0 ON x625 CANCELLED AT 2024-07-11T04:00:24 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 6229197 ON x625 CANCELLED AT 2024-07-11T04:00:24 DUE TO TIME LIMIT ***

Tasks not launching properly:

srun: error: timeout waiting for task launch, started 96 of 108 tasks
srun: StepId=6229110.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 6229110.0 ON x3 CANCELLED AT 2024-07-11T05:22:07 ***

Manual running of the WE2E coverage tests were also launched on Jet yesterday. There are no time-outs for test stages while manually running, so the tests ran through to completion. Three tests ultimately failed with the above errors. Once resubmission of the failed tests successfully pass, this PR will be merged.

gsketefian commented 2 months ago

@MichaelLueken Thanks for the update.

MichaelLueken commented 2 months ago

The manual runs of the WE2E coverage tests successfully passed on Jet:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240710185903                                           COMPLETE              17.92
custom_ESGgrid_20240710185904                                      COMPLETE             155.97
custom_ESGgrid_Great_Lakes_snow_8km_20240710185905                 COMPLETE              22.16
custom_GFDLgrid_20240710185907                                     COMPLETE               9.95
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202407  COMPLETE               9.29
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              87.54
get_from_HPSS_ics_RAP_lbcs_RAP_20240710185909                      COMPLETE              16.11
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240710185910  COMPLETE             615.42
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              64.71
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.93
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             930.56
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1936.56

Moving forward with merging this work now.