ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Expand forecast fields for metric test #1048

Closed EdwardSnyder-NOAA closed 5 months ago

EdwardSnyder-NOAA commented 5 months ago

DESCRIPTION OF CHANGES:

This PR expands the number of forecast fields for the Skill Score metric test. The forecast length in the metric WE2E test was extended to 12 hours so that the RMSE metric can be calculated for these additional forecast fields:

Adding these additional forecast fields will make the skill score metric test more thorough and thus making it a more inclusive test to compare against.

Also, a change was made to the .cicd/scripts/srw_metric_example.sh script to reflect the new conda environment.

Type of change

TESTS CONDUCTED:

Those interested in running the .cicd/scripts/srw_metric_example.sh will need to do the following below. Note, this script builds the app, so this process can run after running manage_externals.

  1. export WORKSPACE=(path of your ufs-srweather-app folder)
  2. export SRW_PLATFORM=(e.g., orion)
  3. export SRW_COMPILER=(e.g., intel)
  4. export SRW_PROJECT=(e.g., epic-ps)
  5. run script: ./.cicd/scripts/srw_metric_example.sh

DEPENDENCIES:

DOCUMENTATION:

ISSUE:

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

RatkoVasic-NOAA commented 5 months ago

WE2E fundamental tests passed on Hera and Jet:

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              10.29
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              14.06
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.54
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              15.44
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              24.69
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240304192  COMPLETE              21.40
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030419291  COMPLETE              22.70
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             117.12

Detailed summary written to /mnt/lfs4/HFIP/hfv3gfs/Ratko.Vasic/1048/expt_dirs/WE2E_summary_20240304221958.txt

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE               8.87
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              12.09
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               7.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              13.30
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024030  COMPLETE              26.23
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240304192  COMPLETE              13.32
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024030419284  COMPLETE              19.29
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             100.28

Detailed summary written to /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/expt_dirs/WE2E_summary_20240304230434.txt
RatkoVasic-NOAA commented 5 months ago

After this commands:

  export WORKSPACE=/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app
  export SRW_PLATFORM=hera
  export SRW_COMPILER=intel
  export ACCOUNT=epic
  ./.cicd/scripts/srw_metric_example.sh

Shell failed with:

+ set -e -u
+ cd /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests
./.cicd/scripts/srw_metric_example.sh: line 53: cd: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests: No such file or directory

Script ./.cicd/scripts/srw_metric_example.sh, line 20, looks like this:

declare workspace
if [[ -n "${WORKSPACE}/${SRW_PLATFORM}" ]]; then
    workspace="${WORKSPACE}/${SRW_PLATFORM}"
else
    workspace="$(cd -- "${script_dir}/../.." && pwd)"
fi

Variable workspace is getting value from SRW_PLATFORM. It looks like you should replace -n (true if there are characters in variable) with -d (true if directory exists).

RatkoVasic-NOAA commented 5 months ago

Also, variable SRW_PROJECT should be set. If not, account will be set to "no_account": <!ENTITY ACCOUNT "no_account">

EdwardSnyder-NOAA commented 5 months ago

After this commands:

  export WORKSPACE=/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app
  export SRW_PLATFORM=hera
  export SRW_COMPILER=intel
  export ACCOUNT=epic
  ./.cicd/scripts/srw_metric_example.sh

Shell failed with:

+ set -e -u
+ cd /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests
./.cicd/scripts/srw_metric_example.sh: line 53: cd: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/1048/ufs-srweather-app/hera/tests: No such file or directory

Script ./.cicd/scripts/srw_metric_example.sh, line 20, looks like this:

declare workspace
if [[ -n "${WORKSPACE}/${SRW_PLATFORM}" ]]; then
    workspace="${WORKSPACE}/${SRW_PLATFORM}"
else
    workspace="$(cd -- "${script_dir}/../.." && pwd)"
fi

Variable workspace is getting value from SRW_PLATFORM. It looks like you should replace -n (true if there are characters in variable) with -d (true if directory exists).

This logic was added to address shared workspaces for Gaea/Gaea-c5 and Hercules/Orion. I checked a number of T1 platforms to see if the SRW_PLATFORM directory exists and found that it only does for Gaea and Hercules/Orion. Given that this variable is a required argument, I'll change the logic to "-d" to avoid errors for non-shared workspace platforms.

EdwardSnyder-NOAA commented 5 months ago

Also, variable SRW_PROJECT should be set. If not, account will be set to "no_account": <!ENTITY ACCOUNT "no_account">

Another good find @RatkoVasic-NOAA! Somehow the experiment passed for me with no account on PW AWS. To resolve this, simply export SRW_PROJECT=<e.g. epic> instead of exporting ACCOUNT. I updated the directions in the PR.

RatkoVasic-NOAA commented 5 months ago

I tested PR on 5 machines, 3 machines passed (Hera, Jet and Hercules), and two machines failed (Orion and Gaea):

Hercules:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Jet:
+ [[ 0.9855 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Hera:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
Gaea:
Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "python/3.10.8"
   Try: "module spider python/3.10.8" to see how to load the module(s).
Orion:
+++ . /apps/intel-2022.1.2/intel-2022.1.2/intelpython/latest/etc/conda/deactivate.d/xgboost_deactivate.sh
/apps/intel-2022.1.2/intel-2022.1.2/intelpython/latest/etc/conda/deactivate.d/xgboost_deactivate.sh: line 16: OCL_ICD_FILENAMES_RESET: unbound variable
RatkoVasic-NOAA commented 5 months ago
And Orion:
+ [[ 0.99043 < 0.700 ]]
+ echo 'Congrats! You pass check!'
EdwardSnyder-NOAA commented 5 months ago

Gaea passed for me with the latest changes here: /gpfs/f5/epic/scratch/Edward.Snyder/pr_1048/ufs-srweather-app

+ [[ 0.98789 < 0.700 ]]
+ echo 'Congrats! You pass check!'
RatkoVasic-NOAA commented 5 months ago

Gaea worked for me as well:

+ [[ 0.98789 < 0.700 ]]
+ echo 'Congrats! You pass check!'
MichaelLueken commented 5 months ago

The Jet get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h WE2E test failed in make_ics and make_lbcs due to OOM issues. Using rocotorewind/rocotoboot, the test successfully passed.

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240308170044                                           COMPLETE              17.40
custom_ESGgrid_20240308170046                                      COMPLETE              17.69
custom_ESGgrid_Great_Lakes_snow_8km_20240308170047                 COMPLETE              12.49
custom_GFDLgrid_20240308170049                                     COMPLETE               8.85
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202403  COMPLETE              10.45
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              50.99
get_from_HPSS_ics_RAP_lbcs_RAP_20240308170053                      COMPLETE              15.61
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240308170055  COMPLETE             215.15
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              41.45
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               8.24
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             494.60
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              10.73
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             903.65

The tests have successfully passed on Derecho, Gaea, and Hercules. The tests are still running on Hera.

MichaelLueken commented 5 months ago

@EdwardSnyder-NOAA -

Given that Hera GNU tests are just sitting in queue for days and the inability to run Hera GNU on Rocky8, the successful run of the Hera Intel will be enough to get this work merged. Once HPSS has returned to service following maintenance, I will manually run the Jenkins coverage tests on Hera Intel and post the summary in this PR.

There was a failure in the get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2mems WE2E test on Orion that is currently being rerun via the pipeline. Once this test successfully completes and the Rocky8 Hera Intel test is complete, I will move forward with merging this PR.

MichaelLueken commented 5 months ago

The Hera Intel coverage WE2E tests were successfully run using Rocky8:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240312180224                            COMPLETE              18.40
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024031  COMPLETE               6.71
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             766.52
get_from_HPSS_ics_HRRR_lbcs_RAP_20240312180228                     COMPLETE              14.39
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.20
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              12.91
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240312180232  COMPLETE              10.50
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.02
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202403  COMPLETE             233.03
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240312  COMPLETE             309.05
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202403121  COMPLETE             330.33
pregen_grid_orog_sfc_climo_20240312180239                          COMPLETE               7.73
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1722.79

The Orion tests are still sitting in queue, so will continue to hold off until the tests on Orion are complete and final check with @EdwardSnyder-NOAA before merging this PR.

MichaelLueken commented 5 months ago

The Jenkins tests failed to kick off the WE2E coverage tests on Orion, so I manually ran them and they all passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240312102903                             COMPLETE             164.23
deactivate_tasks_20240312102905                                    COMPLETE               1.31
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             758.24
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             358.74
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             139.42
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202403121  COMPLETE              15.45
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240312102  COMPLETE             379.58
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              31.00
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             277.98
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202403  COMPLETE              27.61
nco_20240312102917                                                 COMPLETE               7.94
2020_CAD_20240312102919                                            COMPLETE              32.28
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2193.78

I'm running one last test on the latest update to the .cicd/scripts/srw_metric_example.sh script and once it passes, I will reapprove and merge this PR.