ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
53 stars 114 forks source link

[develop] Simplify the way the configuration of the vx is handled #1082

Closed gsketefian closed 2 months ago

gsketefian commented 2 months ago

DESCRIPTION OF CHANGES:

This PR removes the tasks parse_vx_config_[det|ens] that read in and parse the verification (vx) yaml configuration files (one for deterministic vx and another for ensemble vx). Currently, these tasks are used to call the script decouple_fcst_obs_vx_config.py to separate the "coupled" vx config information in the yaml files into forecast and observation components. For example, a "coupled" field name of "CRAIN%%PRWE" in the deterministic configuration file corresponds to a field name of "CRAIN" for forecasts and "PRWE" for the observations (and the "%%" is the delimiter). After separating each item (field name, level, or threshold) in this way, the parse_vx_config_[det|ens] tasks write this separated information into two new intermediate configuration files (one for deterministic and one for ensemble) as two separate dictionaries (one for forecasts and another for observations). These intermediate config files are then read in by the various vx task ex-scripts.

In this PR, the parse_vx_config_[det|ens] tasks and the decouple_fcst_obs_vx_config.py script are removed (so that the intermediate configuration files are no longer created). The separation into forecast and observation values of the "coupled" information in the vx configuration files is now performed in the jinja2 templates for the METplus configuration files, hiding these details from the user.

This PR was prompted by some of the questions by @mkavulich in PR #1005.

Type of change

TESTS CONDUCTED:

The following vx WE2E tests were run on Hera:

MET_ensemble_verification
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx
MET_verification
MET_verification_only_vx
MET_verification_winter_wx

They all completed successfully.

DOCUMENTATION:

No updates necessary.

ISSUE:

None.

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@mkavulich

RatkoVasic-NOAA commented 2 months ago

@gsketefian I ran tests on Hera. It failed for some tests (that passed for you before): /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs/WE2E_summary_20240509175500.txt

/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs/MET_ensemble_verification_only_vx_time_lag/log/get_obs_ndas_2021050500.log

Can you please take a look, in second log fie, it is trying to get from HPSS file that does not exist.

MichaelLueken commented 2 months ago

@RatkoVasic-NOAA -

I saw similar issues with the MET_ensemble_verification_only_vx_time_lag and MET_verification_winter_wx WE2E tests. I believe that these were unavailable during the HPSS library maintenance today. Once the maintenance concluded, I rocotorewind/rocotoboot the get_verif_obs tasks and they successfully passed. Please try rerunning these tasks/tests now and see if they continue to fail.

gsketefian commented 2 months ago

@MichaelLueken @RatkoVasic-NOAA I just pushed a few more commits that further streamline the METplus config templates. I reran the test MET_ensemble_verification_only_vx, and it worked. This test checks all the latest changes, so the rest of the tests should work fine. Go ahead and retest at your convenience. Thanks.

RatkoVasic-NOAA commented 2 months ago

After two tries, it passed on Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_20240510145323                           COMPLETE              21.52
MET_ensemble_verification_only_vx_20240510145325                   COMPLETE               1.04
MET_ensemble_verification_only_vx_time_lag_20240510145328          COMPLETE               3.90
MET_ensemble_verification_winter_wx_20240510145331                 COMPLETE             112.10
MET_verification_20240510145332                                    COMPLETE              10.12
MET_verification_only_vx_20240510145334                            COMPLETE               0.23
MET_verification_winter_wx_20240510145335                          COMPLETE              16.54
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             165.45

Detailed summary written to /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs/WE2E_summary_20240510152858.txt
RatkoVasic-NOAA commented 2 months ago

@gsketefian , can you please check what is going on with this one:

/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs/MET_ensemble_verification_only_vx_time_lag/log/run_MET_GenEnsProd_vx_RETOP_2021050500.log
/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs/MET_ensemble_verification_only_vx_time_lag/log/run_MET_GenEnsProd_vx_RETOP_2021050500.log.0

It is failing randomly. Yesterday it failed twice in a row causing rocoto to give up and whole test to fail. Here is same and additional test that failed more than once:

Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/SRW-1082/expt_dirs-old/MET_ensemble_verification_only_vx_time_lag/log>
run_MET_GenEnsProd_vx_REFC_2021050500.log
run_MET_GenEnsProd_vx_REFC_2021050500.log.0
run_MET_GenEnsProd_vx_RETOP_2021050500.log
run_MET_GenEnsProd_vx_RETOP_2021050500.log.0

I wonder if you have an idea what might be causing this random occurrence?

gsketefian commented 2 months ago

@RatkoVasic-NOAA I think I figured out the problem.

What's happening is that during the run_MET_GenEnsProd_vx_REFC task (and also for RETOP, but I'll just talk about REFC for simplicity), in exregional_run_met_genensprod_or_ensemblestat.sh the call to set_vx_fhr_list is failing to find some or all of the obs files because the task get_obs_mrms has not yet generated these files (because it hasn't launched yet, or has not yet completed). But these obs files aren't necessary for running run_MET_GenEnsProd_vx_REFC because GenEnsProd only takes forecast files as inputs. The reason this check exists is that this ex-script is also used for the EnsembleStat tasks, which do take observation files as inputs.

There are a couple of quick-and-dirty (i.e. hard-coded) fixes for this, but I'd rather implement a more thorough fix. Since this is something that predates this PR and is not directly related, I'd like to create an issue and take a look next week. Are you ok with that? Otherwise I can do a one-liner quick fix.

RatkoVasic-NOAA commented 2 months ago

@gsketefian great! I suggest you do 'dirty' quick fix now (although not related to this PR), and open an issue for nice fix for later. @MichaelLueken what do you think? I'll approve this PR either way.

MichaelLueken commented 2 months ago

@gsketefian great! I suggest you do 'dirty' quick fix now (although not related to this PR), and open an issue for nice fix for later. @MichaelLueken what do you think? I'll approve this PR either way.

@RatkoVasic-NOAA I'd recommend opening an issue to properly address this issue rather than create a temporary fix. A quick-and-dirty fix could potentially break the way the workflow runs for certain tests, leading to more issues. I'd err on the side of caution and focus on a fix that will work for all verification tests.

@gsketefian Once the parm/metplus/STATAnalysisConfig_skill_score file has been added back and @EdwardSnyder-NOAA has approved, I will kick of the automated Jenkins tests.

gsketefian commented 2 months ago

@MichaelLueken @RatkoVasic-NOAA @EdwardSnyder-NOAA I added back the file STATAnalysisConfig_skill_score and created an issue at #1084. I think this PR is ready to (re)test. Thanks.

RatkoVasic-NOAA commented 2 months ago

Except expected failures which are reported in issue #1084, tests passed on Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_20240510195653                           COMPLETE              24.54
MET_ensemble_verification_only_vx_20240510195655                   COMPLETE               1.04
MET_ensemble_verification_only_vx_time_lag_20240510195657          DEAD                   3.49
MET_ensemble_verification_winter_wx_20240510195700                 COMPLETE             122.83
MET_verification_20240510195702                                    COMPLETE              10.39
MET_verification_only_vx_20240510195703                            COMPLETE               0.23
MET_verification_winter_wx_20240510195705                          COMPLETE              17.69
----------------------------------------------------------------------------------------------------
Total                                                              DEAD                 180.21
MET_ensemble_verification_only_vx_time_lag/log/run_MET_GenEnsProd_vx_REFC_2021050500.log.0
MET_ensemble_verification_only_vx_time_lag/log/run_MET_GenEnsProd_vx_RETOP_2021050500.log.0

Approved.

MichaelLueken commented 2 months ago

The Jenkins tests successfully passed on all platforms. Moving forward with merging this work now.