ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Verification upgrades and bug fixes #973

Closed gsketefian closed 8 months ago

gsketefian commented 10 months ago

DESCRIPTION OF CHANGES:

This PR cleans up and simplifies the verification tasks in the SRW App. Main changes:

Type of change

TESTS CONDUCTED:

The set of fundamental WE2E tests as well as all the verification tests were run on Hera with Intel. All completed successfully. The fundamental tests are:

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

The verification tests are:

MET_ensemble_verification
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx
MET_verification
MET_verification_only_vx
MET_verification_winter_wx

Manual regression tests were also run on the following WE2E tests:

MET_verification_winter_wx [aka custom_ESGgrid_Great_Lakes_snow_8km]
MET_ensemble_verification_only_vx
MET_ensemble_verification_winter_wx

All had minor expected differences in results relative to the develop branch. There was a major difference in output (stat files) from the run_MET_GridStat_vx_ensprob_ASNOW06h task of the MET_ensemble_verification_winter_wx, but that is due to the bug fix in GridStat_ensprob_ASNOW.conf regarding the mismatch between forecast and obs thresholds (and is thus expected).

DEPENDENCIES:

None

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@michelleharrold @JeffBeck-NOAA @willmayfield

MichaelLueken commented 9 months ago

@gsketefian - With respect to the MET and METplus bugs that you are encountering, is the issue with the SRW App, or with MET and METplus? If the issue is with MET and METplus, would it be useful at all to attempt to merge PR #969 into a test version of your feature/vx_upgrades branch to see if using a later version of MET and METplus corrects the issue you are encountering? Thanks!

gsketefian commented 9 months ago

@gsketefian - With respect to the MET and METplus bugs that you are encountering, is the issue with the SRW App, or with MET and METplus? If the issue is with MET and METplus, would it be useful at all to attempt to merge PR #969 into a test version of your feature/vx_upgrades branch to see if using a later version of MET and METplus corrects the issue you are encountering? Thanks!

@MichaelLueken The issue is with MET/METplus, and I heard back from METplus developers as to the reason (if interested, see this discussion). I'm now working on the most appropriate fix. I will ask whether a later version of MET/METplus may solve this (but I doubt it; I would have to ask for this change in MET/METplus, and, if approved, it would have to be included in a future version).

gsketefian commented 8 months ago

@JeffBeck-NOAA @michelleharrold @willmayfield @mkavulich FYI that this vx PR is now open for review. If a couple of you can take a look, that would be great. Thanks!

gsketefian commented 8 months ago

Looks like some great simplifying and cleanup changes...love to see a reduction of almost 3000 lines! šŸ‘

I have a few questions, but since they aren't major and mostly aren't specifically related to these changes I won't hold up this PR

I didn't realize that info was available (easily?). Where can one see the line number change for a PR? There will be a much larger reduction of lines in my next PR :)

MichaelLueken commented 8 months ago

@gsketefian -

At the top of the PR, on the right hand most side, there are green numbers with a plus and red numbers with a minus. The green plus signifies the number of added lines in a PR, while the red minus represents the number of lines removed.

For this PR, I see the following in the top right side:

+1,394 āˆ’4,133

so there were 1,394 added lines, and 4,133 removed lines in this PR.

MichaelLueken commented 8 months ago

The WE2E coverage tests were manually run on Derecho and all successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_ESGgrid_IndianOcean_6km                                     COMPLETE              23.77
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              38.17
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16                COMPLETE              44.85
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_HRRR           COMPLETE              29.32
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta    COMPLETE              17.71
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_HRRR_suite_HRRR                COMPLETE              40.76
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_  COMPLETE              24.76
pregen_grid_orog_sfc_climo                                         COMPLETE              15.86
specify_template_filenames                                         COMPLETE              15.10
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             250.30
gsketefian commented 8 months ago

@gsketefian -

At the top of the PR, on the right hand most side, there are green numbers with a plus and red numbers with a minus. The green plus signifies the number of added lines in a PR, while the red minus represents the number of lines removed.

For this PR, I see the following in the top right side:

+1,394 āˆ’4,133

so there were 1,394 added lines, and 4,133 removed lines in this PR.

Oh right, thanks @MichaelLueken!

gsketefian commented 8 months ago

@JeffBeck-NOAA @RatkoVasic-NOAA @mkavulich Thanks for the reviews!

MichaelLueken commented 8 months ago

@gsketefian - All of the tests passed, with the exception of two tests on Jet:

The Jenkins workspace on Jet can be found: /mnt/lfs1/NAGAPE/epic/role.epic/jenkins/workspace/fs-srweather-app_pipeline_PR-973/jet/expt_dirs.

gsketefian commented 8 months ago

@MichaelLueken Thanks for the update Mike. The PR doesn't touch the make_[ics|lbcs] tasks, so hopefully those are just one-time jet-specific issues.

MichaelLueken commented 8 months ago

The two tests that had failed on Jet - get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h and get_from_HPSS_ics_RAP_lbcs_RAP - have successfully completed following the use of rocotorewind and rocotoboot:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community                                                          COMPLETE              41.46
custom_ESGgrid                                                     COMPLETE              50.50
custom_ESGgrid_Great_Lakes_snow_8km                                COMPLETE              36.93
custom_GFDLgrid                                                    COMPLETE              32.32
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018         COMPLETE              30.57
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h     COMPLETE              50.94
get_from_HPSS_ics_RAP_lbcs_RAP                                     COMPLETE              19.08
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR                 COMPLETE             243.68
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot     COMPLETE              60.16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2        COMPLETE              20.82
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta       COMPLETE             531.87
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR       COMPLETE              18.01
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1136.34
MichaelLueken commented 8 months ago

@gsketefian - Given that @christinaholtNOAA's PR #994 was approved and tested first, I merged that PR first. Changes were made to to the ex-scripts to transition to UW's CLI command line tool, which kicked off conflicts in these scripts in your branch. Please merge the current authoritative develop into your feature/vx_upgrades branch as soon as possible, address the conflicts in the ex-scripts, then I will complete the merge of this PR. Thank you very much!

MichaelLueken commented 8 months ago

@gsketefian -

While attempting to run one last batch of verification tests, specifically running @mkavulich's new MET_ensemble_verification_winter_wx WE2E verification test, the VX_FIELDS in tests/WE2E/test_configs/verification/config.MET_ensemble_verification_winter_wx.yaml needs to be updated to use VX_FIELDS: [ "APCP", "REFC", "RETOP", "ADPSFC", "ADPUPA", "ASNOW" ], rather than VX_FIELDS: [ "APCP", "REFC", "RETOP", "SFC", "UPA", "ASNOW" ]. Once this minor modification is made and my final tests are complete, I will move forward with merging this PR. Thanks!

gsketefian commented 8 months ago

@MichaelLueken I encountered those problems as well with test MET_ensemble_verification_winter_wx. Several ASNOW tasks were failing, and, besides the change to config.MET_ensemble_verification_winter_wx.yaml that you pointed out, it was for the most part a matter of adding the accumulation to the variable name in the ASNOW METplus conf files, e.g. changing

FCST_VAR1_NAME = {{fieldname_in_met_output}}

to

FCST_VAR1_NAME = {{fieldname_in_met_output}}_{{accum_hh}}

I made this change in GenEnsProd_ASNOW.conf, EnsembleStat_ASNOW.conf, GridStat_ensmean_ASNOW.conf, and GridStat_ensprob_ASNOW.conf.

However, I also found a stealthy bug in GridStat_ensprob_ASNOW.conf that changes results (and which @willmayfield will probably be interested in). The issue was an inadvertent shift in the threshold values used in the forecast field array names with respect to the threshold values specified for the observations. For example, for VAR2, the buggy code is

FCST_VAR2_NAME = {{fieldname_in_met_output}}_{{accum_hh}}_A{{accum_no_pad}}_ENS_FREQ_gt0.0
...
OBS_VAR2_THRESH = ge0.508

What it should be is:

FCST_VAR2_NAME = {{fieldname_in_met_output}}_{{accum_hh}}_A{{accum_no_pad}}_ENS_FREQ_ge0.508
...
OBS_VAR2_THRESH = ge0.508

So I think the thresholds for the obs and forecasts were not matching. So although the run_MET_GridStat_vx_ensprob_ASNOW06h task succeeds in the develop branch, I think the results are incorrect. I think I've fixed the issue. @willmayfield if you're interested in taking a look at the results of this test (after I push my latest changes), please let me know and we can wait for you to take a look before merging.

I'm rerunning the test now to make sure it works from scratch and will then push my fixes. Thanks, Gerard

gsketefian commented 8 months ago

@MichaelLueken @willmayfield I reran the MET_ensemble_verification_winter_wx with my newest version, and it was successful. I've also done regression tests on this test as well as MET_ensemble_verification_only_vx and custom_ESGgrid_Great_Lakes_snow_8km. All have only expected differences in the vx output.

Please feel free to retest and merge. Thanks.

MichaelLueken commented 8 months ago

@gsketefian - Here is the current update on the retesting for this PR:

The WE2E coverage tests on Gaea have completed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240112103959                                           COMPLETE              23.22
custom_ESGgrid_NewZealand_3km_20240112104004                       COMPLETE              64.46
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              34.92
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240112104  COMPLETE              31.97
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024011210  COMPLETE              33.87
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             357.80
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024011  COMPLETE              33.36
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             363.78
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              10.55
nco_ensemble_20240112104015                                        COMPLETE              78.47
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             351.98
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1384.38

The WE2E coverage tests on Gaea C5 have completed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20240112104016                                           COMPLETE              43.13
custom_ESGgrid_NewZealand_3km_20240112104024                       COMPLETE              48.67
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              27.85
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240112104  COMPLETE              30.65
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024011210  COMPLETE              31.93
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             313.32
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2024011  COMPLETE              30.43
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             272.79
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              16.73
nco_ensemble_20240112104043                                        COMPLETE              96.57
nco_grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thom  COMPLETE             304.58
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1216.65

The WE2E coverage tests on Hera GNU have completed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Central_Asia_3km_20240112155348                     COMPLETE              36.65
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2019061200_202401  COMPLETE              12.85
get_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS_20240112155352              COMPLETE              20.08
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024011215  COMPLETE              45.85
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              30.48
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0_20240112155  COMPLETE              20.99
long_fcst_20240112155402                                           COMPLETE              95.20
MET_verification_only_vx_20240112155405                            COMPLETE               0.25
MET_ensemble_verification_only_vx_time_lag_20240112155410          COMPLETE               8.98
nco_grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              63.53
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             334.86

The WE2E coverage tests on Hera Intel have completed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_Peru_12km_20240112155349                            COMPLETE              18.60
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_grib2_2019061200_2024011  COMPLETE               6.77
get_from_HPSS_ics_GDAS_lbcs_GDAS_fmt_netcdf_2022040400_ensemble_2  COMPLETE             789.24
get_from_HPSS_ics_HRRR_lbcs_RAP_20240112155354                     COMPLETE              14.18
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE               6.55
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              13.08
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_RAP_suite_RAP_20240112155405  COMPLETE              10.46
grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v15p2_20240  COMPLETE               7.13
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_202401  COMPLETE             240.04
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_20240112  COMPLETE             343.84
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_HRRR_202401121  COMPLETE             332.25
pregen_grid_orog_sfc_climo_20240112155414                          COMPLETE               8.33
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1790.47

The WE2E coverage tests on Hercules have completed successfully:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
custom_GFDLgrid__GFDLgrid_USE_NUM_CELLS_IN_FILENAMES_eq_FALSE_202  COMPLETE               7.23
grid_CONUS_25km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_202  COMPLETE              10.36
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_202  COMPLETE              27.77
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot  COMPLETE              16.63
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2024011209  COMPLETE              25.20
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240112091  COMPLETE              52.97
grid_RRFS_CONUScompact_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              13.31
grid_RRFS_NA_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20240112091331  COMPLETE              68.37
grid_SUBCONUS_Ind_3km_ics_NAM_lbcs_NAM_suite_GFS_v16_202401120913  COMPLETE              29.07
MET_verification_only_vx_20240112091333                            COMPLETE               0.23
specify_EXTRN_MDL_SYSBASEDIR_ICS_LBCS_20240112091334               COMPLETE               7.74
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             258.88

The tests are still running on both Jet and Orion.

MichaelLueken commented 8 months ago

The WE2E coverage tests have successfully passed on Jet:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
community_20240112203333                                           COMPLETE              19.12
custom_ESGgrid_20240112203338                                      COMPLETE              27.94
custom_ESGgrid_Great_Lakes_snow_8km_20240112203339                 COMPLETE              18.86
custom_GFDLgrid_20240112203344                                     COMPLETE              19.11
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_nemsio_2021032018_202401  COMPLETE              11.38
get_from_HPSS_ics_FV3GFS_lbcs_FV3GFS_fmt_netcdf_2022060112_48h_20  COMPLETE              52.60
get_from_HPSS_ics_RAP_lbcs_RAP_20240112203349                      COMPLETE              17.85
grid_RRFS_AK_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240112203350  COMPLETE             247.62
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20  COMPLETE              50.02
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2_20240  COMPLETE              16.22
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_2024  COMPLETE             521.74
nco_grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_HRRR_2024  COMPLETE              11.71
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1014.17

Still awaiting completion on Orion.

willmayfield commented 8 months ago

@MichaelLueken @gsketefian I tried it again and everything worked fine! I'm good with the changes.

I was worried that something was wrong with these results, but I now know that the problem was the model/physics giving unrealistic results on this test case, and not something due to this PR.

MichaelLueken commented 8 months ago

The WE2E coverage tests have successfully passed on Orion:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
custom_ESGgrid_SF_1p1km_20240113115145                             COMPLETE             170.58
deactivate_tasks_20240113115150                                    COMPLETE               1.35
get_from_AWS_ics_GEFS_lbcs_GEFS_fmt_grib2_2022040400_ensemble_2me  COMPLETE             918.85
grid_CONUS_3km_GFDLgrid_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta_  COMPLETE             262.32
grid_RRFS_AK_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_plot_20240  COMPLETE             141.35
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta_202401131  COMPLETE              16.29
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_20240113115  COMPLETE             409.75
grid_RRFS_CONUScompact_13km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_  COMPLETE              30.79
grid_RRFS_CONUScompact_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16_2  COMPLETE             280.11
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202401  COMPLETE              15.15
nco_20240113115203                                                 COMPLETE               7.87
2020_CAD_20240113115205                                            COMPLETE              35.60
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2290.01

Given @willmayfield's continued approval after retesting these changes, I will now move forward with merging this PR.

gsketefian commented 8 months ago

@willmayfield @MichaelLueken Thanks for working on this!