ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

[develop] Fixes for PW Jenkins Nightly Builds #1091

Open EdwardSnyder-NOAA opened 1 month ago

EdwardSnyder-NOAA commented 1 month ago

DESCRIPTION OF CHANGES:

This PR adds logic to handle GCP's default conda env, which conflicts with the SRW App's conda env. Fixes a Parallel Works naming convention bug in the script.

It also addresses a known issue with a Ruby warning on PW instances that prevents the run_WE2E_tests.py from exiting gracefully. The solution we use in our bootstrap for /contrib doesn't seem to work for the /lustre directory, which is why the warning is hardcoded into the monitor_jobs.py script.

The new spack-stack build on Azure is missing a gnu library, so added the path to this missing library to the proper run scripts and cleaned up the wflow noaacloud lua file.

Removed log and error files from the qsub wrapper script so that qsub can generate these files with the job id in the files name. Also, fixed typo in the wrapper script.

Type of change

TESTS CONDUCTED:

DEPENDENCIES:

DOCUMENTATION:

None.

ISSUE:

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@kbooker79, @BruceKropp-Raytheon

EdwardSnyder-NOAA commented 4 weeks ago

This PR passed on AWS using the Jenkins nightly job.

RatkoVasic-NOAA commented 3 weeks ago

When I ran comprehensive tests on Hera, I got one failing test:

Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
2020_CAD_20240619182721                                            COMPLETE              52.22
2020_CAPE_20240619182722                                           COMPLETE              51.67
2019_hurricane_barry_20240619182723                                COMPLETE              48.98
2019_halloween_storm_20240619182724                                COMPLETE              51.43
2019_hurricane_lorenzo_20240619182724                              COMPLETE              52.12
2019_memorial_day_heat_wave_20240619182725                         COMPLETE              49.11
2020_denver_radiation_inversion_20240619182726                     COMPLETE              50.98
2020_easter_storm_20240619182727                                   COMPLETE              51.28
2020_jan_cold_blast_20240619182727                                 COMPLETE              51.81
community_20240619182728                                           COMPLETE              19.92
custom_ESGgrid_20240619182729                                      COMPLETE              21.68
custom_ESGgrid_Central_Asia_3km_20240619182730                     COMPLETE              43.42
custom_ESGgrid_Great_Lakes_snow_8km_20240619182730                 COMPLETE              16.94
custom_ESGgrid_IndianOcean_6km_20240619182732                      COMPLETE              25.14
custom_ESGgrid_NewZealand_3km_20240619182732                       COMPLETE             101.14
custom_ESGgrid_Peru_12km_20240619182733                            COMPLETE              31.04
custom_ESGgrid_SF_1p1km_20240619182734                             COMPLETE             303.29
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               9.07
custom_GFDLgrid_20240619182735                                     COMPLETE               8.12
deactivate_tasks_20240619182736                                    COMPLETE               0.91
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE            1426.37
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024061  COMPLETE               6.71
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202406  COMPLETE              10.06
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202406  COMPLETE              10.25
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              79.25
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE            1458.27
get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS_20240619182741                COMPLETE               7.88
get_from_HPSS_ics_HRRR_lbcs_RAP_20240619182742                     COMPLETE              15.97
get_from_HPSS_ics_RAP_lbcs_RAP_20240619182743                      COMPLETE              17.64
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240619182744              DEAD                  11.84
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              12.41
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             688.04
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             247.35
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240619182747  COMPLETE             461.60
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240619182  COMPLETE              39.87
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              50.09
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024061918  COMPLETE              48.01
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              46.93
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.06
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              13.73
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              14.04
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024061918  COMPLETE              15.57
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240619182  COMPLETE              41.58
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              18.82
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240619182758  COMPLETE              10.76
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               6.64
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024061918280  COMPLETE              22.40
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202406191  COMPLETE              16.76
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202406  COMPLETE             437.73
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             623.11
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240619  COMPLETE             584.82
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240619182  COMPLETE             699.09
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             704.18
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              42.22
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR_20240619  COMPLETE              42.12
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              42.62
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE               9.98
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024061  COMPLETE              39.94
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              10.96
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             524.45
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202406191  COMPLETE             917.47
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             922.26
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240619182813  COMPLETE             128.46
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202406  COMPLETE              47.98
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024061918281  COMPLETE              27.68
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240619182  COMPLETE              24.85
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202406191828  COMPLETE              35.34
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              13.41
long_fcst_20240619182821                                           COMPLETE              98.76
MET_ensemble_verification_only_vx_20240619182822                   COMPLETE               1.04
MET_ensemble_verification_only_vx_time_lag_20240619182826          COMPLETE               3.90
MET_ensemble_verification_winter_wx_20240619182831                 COMPLETE             118.22
MET_verification_only_vx_20240619182832                            COMPLETE               0.23
pregen_grid_orog_sfc_climo_20240619182836                          COMPLETE               7.52
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240619182840               COMPLETE               7.17
specify_template_filenames_20240619182841                          COMPLETE               7.84
----------------------------------------------------------------------------------------------------
Total                                                              DEAD               11968.52

I tried to rerun test several times and it always failed in forecast between hours 05 and 06.

But, when I ran that single test: ./run_WE2E_tests.py -t get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS -m hera -a epic Test pass, but when I look into run directory, it is different day:

from comprehensive: get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS/2024061800
from single test: get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS/2024061700

@MichaelLueken @EdwardSnyder-NOAA do you have an idea what is going on here?

MichaelLueken commented 3 weeks ago

@EdwardSnyder-NOAA -

I have gone ahead and added the DO_NOT_MERGE label temporarily to this PR. Please let me know when you have finished pushing changes to the PR, and I will remove the label, run the a test on Hera GNU, then move forward with final Jenkins testing. Thank you very much!

MichaelLueken commented 3 weeks ago

@EdwardSnyder-NOAA -

With the merging of @RatkoVasic-NOAA's PR #1093, the SRW App is now compiling and running without issue on Hera GNU.

Are there additional changes that are still required for this PR, or is it safe to remove the DO_NOT_MERGE label and run the Jenkins tests now?