Closed gsketefian closed 2 months ago
@MichaelLueken I am running the vx WE2E tests on this now.
@MichaelLueken All the WE2E vx tests passed, and I noted that in the PR message. Thanks.
The Jenkins WE2E coverage tests successfully passed on all machines, with the exception of Jet, where the testing phase was aborted for running longer than 8 hours. Before the tests were aborted, there were two failures - custom_ESGgrid
and custom_ESGgrid_Great_Lakes_snow_8km
.
Both failures appear to be due to Slurm/Node issues on the machine.
Tasks being allocated nodes and hanging until the walltime has passed:
slurmstepd: error: *** STEP 6229197.0 ON x625 CANCELLED AT 2024-07-11T04:00:24 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 6229197 ON x625 CANCELLED AT 2024-07-11T04:00:24 DUE TO TIME LIMIT ***
Tasks not launching properly:
srun: error: timeout waiting for task launch, started 96 of 108 tasks
srun: StepId=6229110.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 6229110.0 ON x3 CANCELLED AT 2024-07-11T05:22:07 ***
Manual running of the WE2E coverage tests were also launched on Jet yesterday. There are no time-outs for test stages while manually running, so the tests ran through to completion. Three tests ultimately failed with the above errors. Once resubmission of the failed tests successfully pass, this PR will be merged.
@MichaelLueken Thanks for the update.
The manual runs of the WE2E coverage tests successfully passed on Jet:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
community_20240710185903 COMPLETE 17.92
custom_ESGgrid_20240710185904 COMPLETE 155.97
custom_ESGgrid_Great_Lakes_snow_8km_20240710185905 COMPLETE 22.16
custom_GFDLgrid_20240710185907 COMPLETE 9.95
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202407 COMPLETE 9.29
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20 COMPLETE 87.54
get_from_HPSS_ics_RAP_lbcs_RAP_20240710185909 COMPLETE 16.11
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240710185910 COMPLETE 615.42
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 64.71
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 6.93
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 930.56
----------------------------------------------------------------------------------------------------
Total COMPLETE 1936.56
Moving forward with merging this work now.
DESCRIPTION OF CHANGES:
This bug was encountered when verifying forecast output that has a 2-digit forecast hour in its name. It turns out specifying the METplus format
%H
to obtain a 2-digit forecast hour in the workflow/verification configuration variableFCST_FN_TEMPLATE
(and others) causes an error in the shell scripteval_METplus_timestr_tmpl.sh
because bash'sprintf
utility does not support the%H
format. This fixes that error using a similar approach to the%HHH
format for obtaining 3-digit hours.Type of change
TESTS CONDUCTED:
The full set of WE2E tests involving vx were run on Hera. These are:
All passed.
DEPENDENCIES:
None needed.
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):
@willmayfield and @michelleharrold encountered this bug, and @mkavulich pinpointed the script it was originating from.