ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

Add the remaining UFS Case Studies #1081

Closed EdwardSnyder-NOAA closed 2 months ago

EdwardSnyder-NOAA commented 2 months ago

DESCRIPTION OF CHANGES:

This PR adds the remaining UFS Case Studies to the SRW App as WE2E tests. These new tests were added to the comprehensive and coverage files as well. Please note that Hurricane Michael's initial and boundary conditions are too old and doesn't contain enough data to be used by the current physics suites, resulting in a failure during the make_ics step.

The pre-existing UFS Case Study WE2E tests were modified to account for an increase in compute resources for the get_extrn_lbcs step. Adding these resources cut the get_extrn_lbcs run time in half, to under two hours.

These tests ran on PW AWS and can be found here: /contrib/Edward.Snyder/ufs-case-studies/all/expt_dirs

Type of change

TESTS CONDUCTED:

2020 Easter Sunday Storm Wind forecast with the FV3_GFS_v16 physics suite matches well with the RAP analysis used in the case study. 10mwind_conus_f072

2019 Memorial Day Heat Wave The temperature forecast with the FV3_GFS_v16 physics suite looks to have a reduced warm bias compared to the SRW_GFSv15p2 suite from the case study. 2mt_conus_f090

2020 January Cold Blast Using the FV3_GFS_v16 physics suite, it appears the cold bias was reduced compared to the physics suites used in the case study. 2mt_conus_f072

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

RatkoVasic-NOAA commented 2 months ago

@EdwardSnyder-NOAA all ufs_case_studies are failing on Hercules and Orion in get_extrn_lbcs part. When we use partition=service in job card, Hercules and Orion are limited to using only 1 node (and they refuse to submit task). Since this job is only getting data (correct me if I'm wrong), we have no advantage in using more than one node. Do you recall any reason for putting nnodes: 2 in tests/WE2E/test_configs/ufs_case_studies/*.yaml files? When I manually changed in FV3LAM_wflow.xml from <nodes>2:ppn=24</nodes> to <nodes>1:ppn=24</nodes> it worked OK.

EdwardSnyder-NOAA commented 2 months ago

@RatkoVasic-NOAA - I added an extra node for the get_extrn_lbcs task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.

RatkoVasic-NOAA commented 2 months ago

@RatkoVasic-NOAA - I added an extra node for the get_extrn_lbcs task, so that the job will finish quicker. These tests pull large tar nemsio files from the AWS S3 bucket. The whole process of fetching and un-tarring files takes up to 4 hours on a single node when running the full case experiment (forecast hour set to 90). Adding an additional node cuts the runtime in half. I wasn't aware of these compute limitations on the other T1 platforms, so I can remove the extra node.

Great, I'm running just those tests on Orion and Hercules now.

RatkoVasic-NOAA commented 2 months ago

Selected tests passed on Orion:

Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095113                                COMPLETE              70.05
2019_hurricane_barry_20240513095115                                COMPLETE              69.63
2019_hurricane_lorenzo_20240513095116                              COMPLETE              70.95
2019_memorial_day_heat_wave_20240513095117                         COMPLETE              67.03
2020_CAD_20240513095117                                            COMPLETE              68.03
2020_CAPE_20240513095118                                           COMPLETE              69.18
2020_denver_radiation_inversion_20240513095119                     COMPLETE              69.26
2020_easter_storm_20240513095120                                   COMPLETE              70.33
2020_jan_cold_blast_20240513095120                                 COMPLETE              72.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             627.14

and Hercules:

Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2019_halloween_storm_20240513095157                                COMPLETE             403.56
2019_hurricane_barry_20240513095158                                COMPLETE             397.70
2019_hurricane_lorenzo_20240513095159                              COMPLETE              43.51
2019_memorial_day_heat_wave_20240513095200                         COMPLETE              41.52
2020_CAD_20240513095200                                            COMPLETE              43.03
2020_CAPE_20240513095201                                           COMPLETE             415.92
2020_denver_radiation_inversion_20240513095202                     COMPLETE              45.34
2020_easter_storm_20240513095202                                   COMPLETE              43.06
2020_jan_cold_blast_20240513095203                                 COMPLETE              44.40
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1478.04

Approving.

MichaelLueken commented 2 months ago

The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h test on Jet failed. The test failed in the make_lbcs task with the following error:

slurmstepd: error: Detected 1 oom_kill event in StepId=3675877.0. Some of the step tasks have been OOM Killed.
srun: error: s40: task 0: Out Of Memory
srun: Terminating StepId=3675877.0

Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.

RatkoVasic-NOAA commented 2 months ago

Once Jet returns from maintenance, I will attempt to rerun the test. Once it passes, I will be able to move forward with merging this work.

HPSS is on maintenance as well, until 10PM EDT.

MichaelLueken commented 2 months ago

The rerun of the get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h WE2E test on Jet this morning successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240513175643                                           COMPLETE              18.43
custom_ESGgrid_20240513175644                                      COMPLETE              26.57
custom_ESGgrid_Great_Lakes_snow_8km_20240513175645                 COMPLETE              20.27
custom_GFDLgrid_20240513175647                                     COMPLETE              11.34
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202405  COMPLETE               8.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              84.74
get_from_HPSS_ics_RAP_lbcs_RAP_20240513175650                      COMPLETE              16.45
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240513175651  COMPLETE             606.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              66.45
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.77
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             926.06
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1794.29

With this, all of the coverage tests have successfully completed.

Moving forward with merging this PR now.