ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Jet switch from CentOS to Rocky #1045

Closed RatkoVasic-NOAA closed 5 months ago

RatkoVasic-NOAA commented 6 months ago

DESCRIPTION OF CHANGES:

Jet is switching from CentOS to Rocky OS.

Type of change

TESTS CONDUCTED:

ISSUE:

Solves issue #1044

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

MichaelLueken commented 5 months ago

The fundamental tests were also successfully run on Jet using CentOS:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.10
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              15.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.28
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.07
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              27.90
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240229203  COMPLETE              21.75
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022920365  COMPLETE              21.27
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             119.88
EdwardSnyder-NOAA commented 5 months ago

Built the SRW App on Rocky 8 using the changes from this PR and ensured the changes worked by running this case: /lfs4/HFIP/hfv3gfs/Edward.Snyder/PR_1045/expt_dirs/grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2

natalie-perlin commented 5 months ago

Fundamental tests ran successfully on Jet (xjet):

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               9.90
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              13.67
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.12
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.18
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              30.38
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240301215  COMPLETE              22.14
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030121531  COMPLETE              22.77
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             122.16

Detailed summary written to /mnt/lfs4/HFIP/hfv3gfs/Natalie.Perlin/SRW/expt_dirs/WE2E_summary_20240301223112.txt
MichaelLueken commented 5 months ago

The Hera Jenkins tests failed due to the system coming down yesterday for maintenance. These tests have been requeued.

There was also a failure on Jet. The get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h test failed in make_lbcs with an OOM error. Using rocotorewind/rocotoboot allowed this test to pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240304152101                                           COMPLETE              21.59
custom_ESGgrid_20240304152102                                      COMPLETE              18.35
custom_ESGgrid_Great_Lakes_snow_8km_20240304152104                 COMPLETE              13.40
custom_GFDLgrid_20240304152106                                     COMPLETE               9.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              10.26
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              49.66
get_from_HPSS_ics_RAP_lbcs_RAP_20240304152110                      COMPLETE              15.30
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240304152111  COMPLETE             222.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              43.97
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               9.64
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             533.34
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.62
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             957.93

Once the Hera tests complete, this PR can be merged.

MichaelLueken commented 5 months ago

The Hera Intel tests were run on Rocky8 and all tests passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240308143348                            COMPLETE              18.07
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024030  COMPLETE               6.05
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             766.89
get_from_HPSS_ics_HRRR_lbcs_RAP_20240308143351                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               5.96
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240308143354  COMPLETE              10.19
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             235.54
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240308  COMPLETE             313.52
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403081  COMPLETE             328.98
pregen_grid_orog_sfc_climo_20240308143359                          COMPLETE               7.09
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1725.63
MichaelLueken commented 5 months ago

@RatkoVasic-NOAA -

Unfortunately, while running the WE2E tests with Rocky8 on Hera GNU, the issue that you noted during the UFS apps and components coordination meeting showed up - all tests are failing due to using srun and not being able to find libpmi.so.0 and libpmi2.so.0.

We will need to hope that the tests are able to run over the weekend on CentOS and no longer set in queue.

MichaelLueken commented 5 months ago

@RatkoVasic-NOAA -

Given that Hera GNU tests are just sitting in queue for days and the inability to run Hera GNU on Rocky8, the successful run of the Hera Intel and the rest of the platforms will be enough to get this work merged.

Since Rocky8 will be the default package of the nodes following today's update, I will go ahead and set the spack-stack path to point at the rocky8 location and change the ush/machine/jet.yaml file to use xJet for the forecast tasks. Once Jet is returned, Kris Booker and I will check to ensure that the Jet runner is using one of the Rocky8 front ends, then I will run the Jet tests one last time. Once complete, this PR will get merged.

MichaelLueken commented 5 months ago

The rerun of the Jenkins tests on Jet had one failure, grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2. The run_fcst task was failing with:

FATAL from PE 1: compute_qs: saturation vapor pressure table overflow, nbad= 1

None of the changes made in this PR will cause this issue. The use of rocotorewind/rocotoboot allowed the failed task to successfully pass:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240312211355                                           COMPLETE              19.64
custom_ESGgrid_20240312211357                                      COMPLETE              18.79
custom_ESGgrid_Great_Lakes_snow_8km_20240312211358                 COMPLETE              14.27
custom_GFDLgrid_20240312211400                                     COMPLETE              10.02
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              11.20
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              57.17
get_from_HPSS_ics_RAP_lbcs_RAP_20240312211404                      COMPLETE              17.22
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240312211405  COMPLETE             223.35
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              40.85
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.38
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             496.47
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.68
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             927.04

Moving forward with merging this PR now.