ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Update for Gaea-c5 #1047

Closed natalie-perlin closed 5 months ago

natalie-perlin commented 6 months ago

DESCRIPTION OF CHANGES:

A solution to solve library conflict for libstdc++.so.6 was to preload a specific library during a runtime, as specified in ./modulefiles/wflow_gaea.lua , ./modulefiles/tasks/gaea/python_srw.lua:

setenv("LD_PRELOAD", "/opt/cray/pe/gcc/12.2.0/snos/lib64/libstdc++.so.6")

Type of change

TESTS CONDUCTED:

Conducted fundamental tests on Gaea (c5), all pass

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

https://github.com/ufs-community/ufs-srweather-app/issues/991

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

ADDITIONAL NOTES:

A summary after running the fundamental test suite:

All 7 experiments finished
Calculating core-hour usage and printing final summary
----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              17.93
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              25.51
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              12.12
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.37
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              30.31
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240226220  COMPLETE              31.79
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022622035  COMPLETE              45.33
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             188.36

Detailed summary written to /gpfs/f5/epic/scratch/Natalie.Perlin/SRW/expt_dirs/WE2E_summary_20240226231918.txt

Comprehensive tests pass successfully, a log file WE2E_tests_20240227100902.yaml attached WE2E_tests_20240227100902.yaml.txt

natalie-perlin commented 6 months ago

All the comprehensive tests pass on Gaea -

All 63 experiments finished
Calculating core-hour usage and printing final summary

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
2020_CAD_20240227100652                                            COMPLETE              32.92
community_20240227100655                                           COMPLETE              42.97
custom_ESGgrid_20240227100657                                      COMPLETE              13.16
custom_ESGgrid_Central_Asia_3km_20240227100658                     COMPLETE              32.93
custom_ESGgrid_IndianOcean_6km_20240227100700                      COMPLETE              22.37
custom_ESGgrid_NewZealand_3km_20240227100702                       COMPLETE              47.09
custom_ESGgrid_Peru_12km_20240227100703                            COMPLETE              21.82
custom_ESGgrid_SF_1p1km_20240227100705                             COMPLETE             166.96
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               9.32
custom_GFDLgrid_20240227100708                                     COMPLETE               8.34
deactivate_tasks_20240227100710                                    COMPLETE               0.85
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             689.14
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240227100713              COMPLETE              21.15
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              14.99
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             241.83
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             129.31
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240227100721  COMPLETE             167.76
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240227100  COMPLETE              30.20
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              33.87
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024022710  COMPLETE              29.77
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              30.24
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              11.41
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              25.24
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              25.02
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024022710  COMPLETE              38.42
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240227100  COMPLETE              70.09
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              40.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240227100741  COMPLETE              18.57
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE              11.38
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024022710074  COMPLETE              45.38
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202402271  COMPLETE              36.33
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202402  COMPLETE             234.79
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             312.53
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240227  COMPLETE             323.93
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240227100  COMPLETE             356.07
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             363.14
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              30.57
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240227  COMPLETE              27.70
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              26.48
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              18.91
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              30.78
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              17.99
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             259.37
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202402271  COMPLETE             277.35
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             280.78
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240227100814  COMPLETE              75.78
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202402  COMPLETE              32.47
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022710081  COMPLETE              39.80
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240227100  COMPLETE              32.00
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202402271008  COMPLETE              51.82
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              16.51
MET_ensemble_verification_only_vx_20240227100826                   COMPLETE               1.02
MET_ensemble_verification_winter_wx_20240227100830                 COMPLETE             201.20
MET_verification_only_vx_20240227100833                            COMPLETE               0.21
nco_20240227100837                                                 COMPLETE              21.30
nco_ensemble_20240227100840                                        COMPLETE             100.75
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              30.83
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              25.31
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             314.11
natalie-perlin commented 6 months ago

@kbooker79 @jkbk2004 - please see @MichaelLueken comment above

The way current SRW tests and Jenkins tests are developed, the SRW requires to have the platform name consistent with the name for Jenkins, which is "gaeac5" The file .cicd/Jenkinsfile and the .cicd/scripts/*.sh are dependent on the SRW_PLATFORM entry. This entry is the Jenkins label.

As we do not have any other "gaea" platform, and there is no need to differentiate between Gaea C4 ("gaea") and Gaea C5 ("gaeac5") for Jenkins tests, could the platform label in Jenkins be changed to just "gaea"?

MichaelLueken commented 6 months ago

@kbooker79 and @jkbk2004 -

The SRW App's .cicd/Jenkinsfile and .cicd/scripts/*.sh tests heavily utilize env.SRW_PLATFORM in Jenkins. To ensure that we don't encroach on Orion/Hercules, all workspaces require an additional dir ("${env.SRW_PLATFORM}") step in the stages. In the srw_build and srw_test scripts, SRW_PLATFORM is passed to tests/build.sh (which is then used for choosing the build modulefile) and passed to tests/WE2E/setup_WE2E_tests.py (which is then used to choose the wflow modulefile and task modulefiles). To move to gaea, this will require a rework of the entire Jenkinsfile and Jenkins test scripts, which go beyond the scope of this PR.

kbooker79 commented 6 months ago

@kbooker79 @jkbk2004 - please see @MichaelLueken comment above

The way current SRW tests and Jenkins tests are developed, the SRW requires to have the platform name consistent with the name for Jenkins, which is "gaeac5" The file .cicd/Jenkinsfile and the .cicd/scripts/*.sh are dependent on the SRW_PLATFORM entry. This entry is the Jenkins label.

As we do not have any other "gaea" platform, and there is no need to differentiate between Gaea C4 ("gaea") and Gaea C5 ("gaeac5") for Jenkins tests, could the platform label in Jenkins be changed to just "gaea"?

@natalie-perlin, I suppose we can do that but we'll have to do some test with MRW (ufs-weather-model) pipelines to ensure that everything still works

MichaelLueken commented 6 months ago

@natalie-perlin -

I'll try and make changes to the .cicd/scripts/srw_build.sh, wrapper_srw_ftest.sh, and srw_test.sh scripts to allow them to work with gaea. I'll let you know how this work turns out and then we can move forward from there.

MichaelLueken commented 6 months ago

@natalie-perlin -

I was able to have the SRW App Jenkins scripts set the Gaea C5 platform as gaea, allowing the current build, ftest, and test scripts to run on Gaea C5 using gaea modulefiles and entries. I have opened PR #10 in your fork with these significantly reduced changes. Once they have been merged, I will approve this PR.

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240227152426                                           COMPLETE              42.69
custom_ESGgrid_NewZealand_3km_20240227152428                       COMPLETE              48.03
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              26.79
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240227152  COMPLETE              29.81
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024022715  COMPLETE              30.68
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             314.00
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              29.91
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             278.04
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              16.51
nco_ensemble_20240227152441                                        COMPLETE              96.21
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             312.42
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1225.09
MichaelLueken commented 5 months ago

The Jenkins tests have successfully passed on Derecho, Hera GNU, Hercules, and Orion.

On Jet, the custom_ESGgrid_Great_Lakes_snow_8km and get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf failed. Using rocotorewind/rocotoboot, these tests ahve successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240228075903                                           COMPLETE              19.44
custom_ESGgrid_20240228075908                                      COMPLETE              19.78
custom_ESGgrid_Great_Lakes_snow_8km_20240228075912                 COMPLETE              15.96
custom_GFDLgrid_20240228075917                                     COMPLETE              10.92
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202402  COMPLETE              13.67
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              61.07
get_from_HPSS_ics_RAP_lbcs_RAP_20240228075928                      COMPLETE              18.88
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240228075930  COMPLETE             245.92
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              41.53
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               9.24
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             549.41
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.41
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1016.23

On Hera Intel, the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test failed. The test is being rerun using rocotorewind/rocotoboot. With no allocation on the machine currently, it will likely take all day for this test to successfully complete.

@kbooker79 was able to restart the Jenkins runner on Gaea C5 and the Jenkins tests have successfully cloned the external repositories and have moved onto the Build stage. I will let you know if there are any issues on the machine.

MichaelLueken commented 5 months ago

By utilizing the Rocky8 nodes on Hera, the rerun of the grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 test successfully completed very quickly:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240228130055                            COMPLETE              17.47
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024022  COMPLETE               5.96
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             787.75
get_from_HPSS_ics_HRRR_lbcs_RAP_20240228130059                     COMPLETE              14.00
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.74
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              13.19
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240228130103  COMPLETE               9.91
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.59
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202402  COMPLETE             232.70
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240228  COMPLETE             304.39
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202402281  COMPLETE             325.62
pregen_grid_orog_sfc_climo_20240228130110                          COMPLETE               8.28
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1733.60

Once the Gaea tests complete, I will move forward with merging this work (the Build stage has successfully completed and the Functional Workflow Task Tests stage is now running).

MichaelLueken commented 5 months ago

The Gaea C5 tests successfully completed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240229103008                                           COMPLETE              43.17
custom_ESGgrid_NewZealand_3km_20240229103010                       COMPLETE              47.87
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              26.67
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240229103  COMPLETE              28.79
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024022910  COMPLETE              29.17
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             315.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024022  COMPLETE              30.42
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             277.45
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              16.48
nco_ensemble_20240229103023                                        COMPLETE              95.93
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             319.81
2020_CAPE_20240229103029                                           COMPLETE              36.08
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1267.64

While queuing the Gaea tests, the Hera tests were also queued. These tests were aborted.

Moving forward with merging this PR now.