Closed EdwardSnyder-NOAA closed 2 months ago
@EdwardSnyder-NOAA all ufs_case_studies
are failing on Hercules and Orion in get_extrn_lbcs
part. When we use partition=service
in job card, Hercules and Orion are limited to using only 1 node (and they refuse to submit task).
Since this job is only getting data (correct me if I'm wrong), we have no advantage in using more than one node. Do you recall any reason for putting nnodes: 2
in tests/WE2E/test_configs/ufs_case_studies/*.yaml
files?
When I manually changed in FV3LAM_wflow.xml from <nodes>2:ppn=24</nodes>
to <nodes>1:ppn=24</nodes>
it worked OK.
@RatkoVasic-NOAA - I added an extra node for the get_extrn_lbcs
task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.
@RatkoVasic-NOAA - I added an extra node for the
get_extrn_lbcs
task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.
Great, I'm running just those tests on Orion and Hercules now.
Selected tests passed on Orion:
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095113 COMPLETE 70.05
2019_hurricane_barry_20240513095115 COMPLETE 69.63
2019_hurricane_lorenzo_20240513095116 COMPLETE 70.95
2019_memorial_day_heat_wave_20240513095117 COMPLETE 67.03
2020_CAD_20240513095117 COMPLETE 68.03
2020_CAPE_20240513095118 COMPLETE 69.18
2020_denver_radiation_inversion_20240513095119 COMPLETE 69.26
2020_easter_storm_20240513095120 COMPLETE 70.33
2020_jan_cold_blast_20240513095120 COMPLETE 72.68
----------------------------------------------------------------------------------------------------
Total COMPLETE 627.14
and Hercules:
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095157 COMPLETE 403.56
2019_hurricane_barry_20240513095158 COMPLETE 397.70
2019_hurricane_lorenzo_20240513095159 COMPLETE 43.51
2019_memorial_day_heat_wave_20240513095200 COMPLETE 41.52
2020_CAD_20240513095200 COMPLETE 43.03
2020_CAPE_20240513095201 COMPLETE 415.92
2020_denver_radiation_inversion_20240513095202 COMPLETE 45.34
2020_easter_storm_20240513095202 COMPLETE 43.06
2020_jan_cold_blast_20240513095203 COMPLETE 44.40
----------------------------------------------------------------------------------------------------
Total COMPLETE 1478.04
Approving.
The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h
test on Jet failed. The test failed in the make_lbcs
task with the following error:
slurmstepd: error: Detected 1 oom_kill event in StepId=3675877.0. Some of the step tasks have been OOM Killed.
srun: error: s40: task 0: Out Of Memory
srun: Terminating StepId=3675877.0
Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.
Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.
HPSS is on maintenance as well, until 10PM EDT.
The rerun of the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h
WE2E test on Jet this morning successfully passed:
----------------------------------------------------------------------------------------------------
Experiment name | Status | Core hours used
----------------------------------------------------------------------------------------------------
community_20240513175643 COMPLETE 18.43
custom_ESGgrid_20240513175644 COMPLETE 26.57
custom_ESGgrid_Great_Lakes_snow_8km_20240513175645 COMPLETE 20.27
custom_GFDLgrid_20240513175647 COMPLETE 11.34
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202405 COMPLETE 8.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20 COMPLETE 84.74
get_from_HPSS_ics_RAP_lbcs_RAP_20240513175650 COMPLETE 16.45
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240513175651 COMPLETE 606.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20 COMPLETE 66.45
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240 COMPLETE 8.77
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024 COMPLETE 926.06
----------------------------------------------------------------------------------------------------
Total COMPLETE 1794.29
With this, all of the coverage tests have successfully completed.
Moving forward with merging this PR now.
DESCRIPTION OF CHANGES:
This PR adds the remaining UFS Case Studies to the SRW App as WE2E tests. These new tests were added to the comprehensive and coverage files as well. Please note that Hurricane Michael's initial and boundary conditions are too old and doesn't contain enough data to be used by the current physics suites, resulting in a failure during the
make_ics
step.The pre-existing UFS Case Study WE2E tests were modified to account for an increase in compute resources for the
get_extrn_lbcs
step. Adding these resources cut theget_extrn_lbcs
run time in half, to under two hours.These tests ran on PW AWS and can be found here: /contrib/Edward.Snyder/ufs-case-studies/all/expt_dirs
Type of change
TESTS CONDUCTED:
2020 Easter Sunday Storm Wind forecast with the FV3_GFS_v16 physics suite matches well with the RAP analysis used in the case study.
2019 Memorial Day Heat Wave The temperature forecast with the FV3_GFS_v16 physics suite looks to have a reduced warm bias compared to the SRW_GFSv15p2 suite from the case study.
2020 January Cold Blast Using the FV3_GFS_v16 physics suite, it appears the cold bias was reduced compared to the physics suites used in the case study.
DEPENDENCIES:
DOCUMENTATION:
ISSUE:
CHECKLIST
LABELS (optional):
A Code Manager needs to add the following labels to this PR:
CONTRIBUTORS (optional):