ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop] Streamline SRW App's interface to MET/METplus #1005

Closed gsketefian closed 3 months ago

gsketefian commented 7 months ago

DESCRIPTION OF CHANGES:

This PR streamlines the SRW App's interface to the MET/METplus verification tool and implements some bug fixes. Details:

1) Replace the field-specific METplus configuration jinja2 templates associated with each METplus tool (these templates are hard-coded for each field) with a single template that contains jinja2 code to any valid field to be verified. For example, for the EnsembleStat tool, the six templates

    EnsembleStat_APCP.conf
    EnsembleStat_ASNOW.conf
    EnsembleStat_REFC.conf
    EnsembleStat_RETOP.conf
    EnsembleStat_ADPSFC.conf
    EnsembleStat_ADPUPA.conf
are replaced by the single template `EnsembleStat.conf`.

2) Add yaml configuration files for verification that specify the fields to verify (including field levels and thresholds). This is in order to consolidate the field/level/threshold information in one place instead of having it spread out and repeated in several hard-coded configuration files. Two such yaml configuration files are added:

  1. vx_config_det.yaml -- for deterministic verification
  2. vx_config_ens.yaml -- for ensemble verification

3) Add a python script (decouple_fcst_obs_vx_config.py) to parse these two vx configuration files and create a dictionary of the field/level/threshold information that can then be passed to the unified workflow templating tool.

4) Modify the ex-scripts for the verification tasks (exregional_run_met_....sh) to allow the use of the new jinja2 METplus config templates. This includes adding code to call the new script decouple_fcst_obs_vx_config.py and then passing its output to the unified workflow templating tool to generate METplus configuration files from the (new) jinja2 templates.

5) Add new environment variables to the rocoto workflow configuration files (verify_[pre|det|ens].yaml) that are needed for using the new jinja2 METplus config templates.

6) Bug fixes: 1) For deterministic verification with METplus's PointStat tool, some of the fields to be verified were missing the set_attr_lead option in the METplus configuration file that accounts for time-lagging when setting the entry in the FCST_LEAD column of the output stat file. This option causes the FCST_LEAD entry of time-lagged members to have the same lead time as non-lagged members (i.e. it converts actual lead time to lead time assuming no time-lagging). This option was added where it was missing (e.g. if variable number 14 was missing this, the following line may have been added: FCST_VAR14_OPTIONS = set_attr_lead = "{lead?fmt=%H%M%S}";). The field/level combinations that were missing this option were: 1) For PointStat verification of ADPSFC (surface) fields: The set_attr_lead option was added to SPFH at Z2, CRAIN at L0, CSNOW at L0, CFRZR at L0, and CICEP at L0. 2) For PointStat verification of ADPUPA (upper air) fields: The set_attr_lead option was added to CAPE at L0-90. 2) For ensemble verification, in unifying the threshold settings into the yaml configuration file vx_config_ens.yaml, the following bug fixes were automatically implemented: 1) GenEnsProd:

Type of change

TESTS CONDUCTED:

The set of fundamental WE2E tests as well as all the verification tests were run on Hera with Intel. All completed successfully. The fundamental tests are:

grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta
nco_grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_timeoffset_suite_GFS_v16
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8_plot
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR
grid_SUBCONUS_Ind_3km_ics_HRRR_lbcs_RAP_suite_WoFS_v0
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16

The verification tests are:

MET_ensemble_verification
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx
MET_verification
MET_verification_only_vx
MET_verification_winter_wx

Manual regression tests were also run on the following WE2E tests:

MET_verification_winter_wx [aka custom_ESGgrid_Great_Lakes_snow_8km]
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx

All four had several minor (and expected) differences in results relative to the develop branch that did not involve numerical values, only metadata values (e.g. an accumulation level went from A03 to A3). There were also some changes to numerical values, but all were expected consequences of the bug fixes in thresholds described above.

DOCUMENTATION:

No documentation changes are needed.

CHECKLIST

LABELS (optional):

A Code Manager needs to add the following labels to this PR:

CONTRIBUTORS (optional):

@michelleharrold @JeffBeck-NOAA

gsketefian commented 7 months ago

@MichaelLueken The old uw_tools external (hash e1b3b6f of the uw_tools repo) works with my code, but the new UW CLI tool for templating (introduced in PR #994) does not. So to test this PR, I had to add a subdirectory (parm/metplus/uw_tools) containing the necessary scripts from the old uw_tools external. We can remove this subdirectory if @christinaholtNOAA or someone else can look into why the UW CLI templater doesn't work. The calls to the templater are in the 6 ex-scripts exregional_run_met_....sh.

gsketefian commented 7 months ago

@JeffBeck-NOAA @michelleharrold @mkavulich @willmayfield Do you mind taking a look at this PR when you get a chance? Thanks.

MichaelLueken commented 7 months ago

@gsketefian - It looks like the CLI tool is still working for the exregional_run_met_pb2nc_obs.sh script. I just see the removal of field_thresholds and replacing cat with printf for this script. Do your verification configuration modifications not affect met_pb2nc_obs as heavily as the other verifications tasks?

gsketefian commented 7 months ago

@MichaelLueken Yes, you're right, exregional_run_met_pb2nc_obs.sh works with the new UW CLI templater. I think it's because in that script the settings variable does not contain the dictionary variable vx_config_dict that the rest of the vx ex-scripts have. vx_config_dict is a multi-level nested dictionary of vx information (field groups, field names, levels, and thresholds). It's the inclusion of vx_config_dict in settings that seems to cause a problem for the UW CLI templater (but not for the scripts in the old uw_tools repository). I didn't have time to explore the reason for this.

gsketefian commented 7 months ago

@MichaelLueken FYI I will probably not be at the SRW code management meeting today (1/25/2024) due to other work I need to get done. I'm happy to merge this PR in as-is (assuming it passes tests) if no one is available to take a look at the UW CLI issue (we can take care of that in a future PR). Thanks!

gsketefian commented 6 months ago

@MichaelLueken @christinaholtNOAA I've narrowed down where (and possibly why) the failure during the call to the uwtools CLI templater happens, and I hope that will help you pinpoint the problem and find a fix.

To recap, if I use an older version of uwtools before the introduction of the conda python package installation, the vx tasks in my WE2E tests work, i.e. they are able to generate a METplus configuration file from the jinja2 templates I have created in this PR. However, when in the ex-scripts for the vx tasks I switch to the new uwtools CLI (that is supposed to use the conda-installed packages), those tasks fail because the METplus config files are not rendered from the jinja2 templates. The question now is, why does the uwtools CLI approach fail?

To verify that this is really what's going on, I first ran the WE2E test MET_ensemble_verification_only_vx using my branch as-is (hash 16b4d7b4d8e). That was successful. This hash uses the uwtools templating code that I copied (with minor modifications) from a previous version of uwtools that I've placed at

ufs-srweather-app/parm/metplus/uw_tools

(I will remove this directory from my branch once we've figured out a way to get things working with the uwtools CLI.) Then I went back and relaunched just the run_MET_GridStat_vx_APCP01h_mem001 task (because that is enough to demonstrate the issue) but this time using the uwtools CLI. I did that by editing exregional_run_met_gridstat_or_pointstat_vx.sh to comment out the snippet

#
# Call the python script to generate the METplus configuration file from
# the jinja template.
#
python3 ${METPLUS_CONF}/uw_tools/templater.py \
  -c "${tmpfile}" \
  -i "${metplus_config_tmpl_fp}" \
  -o "${metplus_config_fp}" || \
print_err_msg_exit "\
Call to workflow-tools templater.py to generate a METplus configuration
file from a jinja template failed.  Parameters passed to this script are:
  Full path to template METplus configuration file:
    metplus_config_tmpl_fp = \"${metplus_config_tmpl_fp}\"
  Full path to output METplus configuration file:
    metplus_config_fp = \"${metplus_config_fp}\"
  Full path to configuration file:
    ${tmpfile}
"
rm $tmpfile

and uncommenting this one:

uw template render \
  -i ${metplus_config_tmpl_fp} \
  -o ${metplus_config_fp} \
  -v \
  --dry-run \
  --values-file "${tmpfile}"

#  --dry_run \
err=$?
rm $tmpfile
if [ $err -ne 0 ]; then
  message_txt="Error rendering template for METplus config.
     Contents of input are:
$settings"
  if [ "${RUN_ENVIR}" = "nco" ] && [ "${MACHINE}" = "WCOSS2" ]; then
    err_exit "${message_txt}"
  else
    print_err_msg_exit "${message_txt}"
  fi
fi

As expected, rerunning the task with this code snippet caused the task to fail.

In the successful run with the old uwtools, the relevant section of the task's log file (run_MET_GridStat_vx_APCP01h_mem001_2019061500.log) looks like this:

...
==>> filepath = /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/tmp.vx_config_det_dict.split_fcst_obs.txt
Running set_template with args: 
----------------------------------------------------------------------
----------------------------------------------------------------------
        outfile: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/2
019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf
 input_template: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/GridStat_or_PointStat.conf
    config_file: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/m
et_plus_settings.ceEdkR.yaml
   config_items: []
        dry_run: False
  values_needed: False
        verbose: False
          quiet: False
       log_file: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/uw_tools/templater.log
----------------------------------------------------------------------
----------------------------------------------------------------------

Calling METplus to run MET's grid_stat tool for field(s): APCP01h
02/16 00:43:46.817Z metplus.d98a3661 INFO: Running METplus v5.1.0 as user Gerard.Ketefian(10666) with command: /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-noavx512/install/intel/2021.5.0/metplus-5.1.0-imllywy/ush/run_metplus.py -c /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/common.conf -c /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/2019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf
...

In the unsuccessful run with the uwtools CLI, that portion looks like this:

...
==>> filepath = /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/tmp.vx_config_det_dict.split_fcst_obs.txt
[2024-02-15T18:24:04]    DEBUG Command: uw template render -i /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/GridStat_or_PointStat.conf -o /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/2019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf -v --values-file /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/met_plus_settings.dxnXqr.yaml
[2024-02-15T18:24:04]    DEBUG Internal arguments:
[2024-02-15T18:24:04]    DEBUG ---------------------------------------------------------------------
[2024-02-15T18:24:04]    DEBUG       input_file: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/parm/metplus/GridStat_or_PointStat.conf
[2024-02-15T18:24:04]    DEBUG      output_file: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/2019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf
[2024-02-15T18:24:04]    DEBUG      values_file: /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/met_plus_settings.dxnXqr.yaml
[2024-02-15T18:24:04]    DEBUG    values_format: yaml
[2024-02-15T18:24:04]    DEBUG        overrides: {}
[2024-02-15T18:24:04]    DEBUG    values_needed: False
[2024-02-15T18:24:04]    DEBUG          dry_run: False
[2024-02-15T18:24:04]    DEBUG ---------------------------------------------------------------------
[2024-02-15T18:24:04]    DEBUG Read initial values from /scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/expt_dirs/test_uw_tools/MET_ensemble_verification_only_vx/met_plus_settings.dxnXqr.yaml
End exregional_run_met_gridstat_or_pointstat_vx.sh at Fri Feb 16 01:24:04 UTC 2024 with error code 1 (time elapsed: 00:00:09)
FATAL ERROR: 
ERROR:
  From script:  "JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX"
  Full path to script:  "/scratch2/BMC/det/Gerard.Ketefian/UFS_CAM/PR_nep_nmep/ufs-srweather-app/jobs/JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX"
Call to ex-script corresponding to J-job "JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX" failed.
Exiting with nonzero status.
End JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX at Fri Feb 16 01:24:04 UTC 2024 with error code 1 (time elapsed: 00:00:10)
...

So you can see that there isn't much of an error message or traceback to follow.

After debugging, I narrowed down the location of the issue to the call to the render() method in templater.py (which is in the directory ufs-srweather-app/conda/pkgs/uwtools-1.0.0-pyh28a5fc4_0/site-packages/uwtools/config), specifically this line:

print(template.render(), file=f)

Note that the print isn't the issue; it's the call to template.render(), as you can show by trying a dry-run. I then tried to narrow things down further by trying to put debugging statements in the jinja code at ufs-srweather-app/conda/pkgs/jinja2-3.0.3-pyhd8ed1ab_0/site-packages, but when none of my debugging statements appeared in the log file, I realized that this jinja package (jinja2 v3.0.3) wasn't the one being used; instead, py-jinja2 v3.1.2 in EPIC's spack-stack at /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-noavx512/install/intel/2021.5.0/py-jinja2-3.1.2-4j2npog/lib/python3.10/site-packages/jinja2 was being used. I'm guessing that the uwtools CLI expects the version of jinja2 that's in the conda installation. If that's the case, then this version difference is potentially the reason for the failure I'm encountering, but of course I can't be sure until I can test with jinja v3.03.

Note that I did try to switch back to jinja2 v3.0.3 by adding its location to the system path at the start of j2template.py (in ufs-srweather-app/conda/pkgs/uwtools-1.0.0-pyh28a5fc4_0/site-packages/uwtools/config) using calls to sys.path.insert() and sys.path.append(), but that didn't work; py-jinja v3.1.2 was still being used. I'm not sure why.

So that's where things currently stand. I appreciate any help you can provide with this issue. Thanks!

MichaelLueken commented 6 months ago

Hi @gsketefian -

On Friday, I was able to successfully clone your feature/metplus_conf_templates branch on Hera, comment out the calls to the templater.py script, and uncomment the uw template render usage. I then submitted the MET_verification_winter_wx WE2E test and it failed:

rocotostat -w FV3LAM_wflow.xml -d FV3LAM_wflow.db -v 10
       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
202302170000               make_grid                    56092166           SUCCEEDED                   0         1          27.0
202302170000               make_orog                    56098540           SUCCEEDED                   0         1          49.0
202302170000          make_sfc_climo                    56098877           SUCCEEDED                   0         1          51.0
202302170000           get_extrn_ics                    56092167           SUCCEEDED                   0         1          11.0
202302170000          get_extrn_lbcs                    56092168           SUCCEEDED                   0         1          15.0
202302170000         make_ics_mem000                    56107864           SUCCEEDED                   0         1          41.0
202302170000        make_lbcs_mem000                    56107865           SUCCEEDED                   0         1          57.0
202302170000         run_fcst_mem000                    56128982           SUCCEEDED                   0         1         234.0
202302170000    run_post_mem000_f000                    56129191           SUCCEEDED                   0         1          21.0
202302170000    run_post_mem000_f001                    56129192           SUCCEEDED                   0         1          21.0
202302170000    run_post_mem000_f002                    56129193           SUCCEEDED                   0         1          21.0
202302170000    run_post_mem000_f003                    56129194           SUCCEEDED                   0         1          13.0
202302170000    run_post_mem000_f004                    56129195           SUCCEEDED                   0         1          13.0
202302170000    run_post_mem000_f005                    56129196           SUCCEEDED                   0         1          17.0
202302170000    run_post_mem000_f006                    56129197           SUCCEEDED                   0         1          14.0
202302170000            get_obs_ccpa                    56092169           SUCCEEDED                   0         1          22.0
202302170000          get_obs_nohrsc                    56092170           SUCCEEDED                   0         1          11.0
202302170000            get_obs_mrms                    56092171           SUCCEEDED                   0         1          43.0
202302170000            get_obs_ndas                    56092172           SUCCEEDED                   0         1          16.0
202302170000       run_MET_Pb2nc_obs                    56092174           SUCCEEDED                   0         1          56.0
202302170000    run_MET_PcpCombine_obs_APCP01h                    56092175           SUCCEEDED                   0         1          27.0
202302170000    run_MET_PcpCombine_obs_APCP03h                    56092176           SUCCEEDED                   0         1          26.0
202302170000    run_MET_PcpCombine_obs_APCP06h                    56092177           SUCCEEDED                   0         1          25.0
202302170000    check_post_output_mem000                    56131083           SUCCEEDED                   0         1          25.0
202302170000    run_MET_PcpCombine_fcst_APCP01h_mem000                    56132041           SUCCEEDED                   0         1          28.0
202302170000    run_MET_PcpCombine_fcst_APCP03h_mem000                    56132042           SUCCEEDED                   0         1          24.0
202302170000    run_MET_PcpCombine_fcst_APCP06h_mem000                    56132043           SUCCEEDED                   0         1          25.0
202302170000    run_MET_PcpCombine_fcst_ASNOW06h_mem000                    56132044           SUCCEEDED                   0         1          25.0
202302170000    run_MET_GridStat_vx_APCP01h_mem000                    56132226                DEAD                   1         2          19.0
202302170000    run_MET_GridStat_vx_APCP03h_mem000                    56132221                DEAD                   1         2          16.0
202302170000    run_MET_GridStat_vx_APCP06h_mem000                    56132239                DEAD                   1         2          12.0
202302170000    run_MET_GridStat_vx_ASNOW06h_mem000                    56132240                DEAD                   1         2          12.0
202302170000    run_MET_GridStat_vx_REFC_mem000                    56132045                DEAD                   1         1          20.0
202302170000    run_MET_GridStat_vx_RETOP_mem000                    56132046                DEAD                   1         1          21.0
202302170000    run_MET_PointStat_vx_ADPSFC_mem000                    56132047                DEAD                   1         1          16.0
202302170000    run_MET_PointStat_vx_ADPUPA_mem000                    56132048                DEAD                   1         1          13.0

The GridStat and PointStat jobs failed with the following error message:

[Errno 2] No such file or directory: '/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GridStat_or_PointStat.conf'

In your modifications to exregional_run_met_gridstat_or_pointstat_vx.sh, I see that you made the following change to metplus_config_tmpl_fn:

metplus_config_tmpl_fn="GridStat_or_PointStat"

Is this the expected behavior?

The path to my clone of your branch on Hera is /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test and the path to my expt_dirs is /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_verification_winter_wx

gsketefian commented 5 months ago

@MichaelLueken Thanks for testing. Apparently I wasn't tracking that file (parm/metplus/GridStat_or_PointStat.conf). I just added it and pushed the commit. Can you retry? Thanks.

MichaelLueken commented 5 months ago

@gsketefian -

Thanks for adding the parm/metplus/GridStat_or_PointStat.conf file!

I was able to rerun the MET_verification_winter_wx test and it is failing in a manner similar to what you are seeing:

==>> filepath = /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/tmp.vx_config_det_dict.split_fcst_obs.txt
[2024-02-29T10:24:30]    DEBUG Command: uw template render -i /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GridStat_or_PointStat.conf -o /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_verification_winter_wx/2023021700//metprd/GridStat/GridStat_APCP01h_mem000.conf -v --values-file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.MP3Z0m.yaml
[2024-02-29T10:24:30]    DEBUG Internal arguments:
[2024-02-29T10:24:30]    DEBUG ---------------------------------------------------------------------
[2024-02-29T10:24:30]    DEBUG       input_file: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GridStat_or_PointStat.conf
[2024-02-29T10:24:30]    DEBUG      output_file: /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_verification_winter_wx/2023021700//metprd/GridStat/GridStat_APCP01h_mem000.conf
[2024-02-29T10:24:30]    DEBUG      values_file: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.MP3Z0m.yaml
[2024-02-29T10:24:30]    DEBUG    values_format: yaml
[2024-02-29T10:24:30]    DEBUG        overrides: {}
[2024-02-29T10:24:30]    DEBUG    values_needed: False
[2024-02-29T10:24:30]    DEBUG          dry_run: False
[2024-02-29T10:24:30]    DEBUG ---------------------------------------------------------------------
[2024-02-29T10:24:30]    DEBUG Read initial values from /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.MP3Z0m.yaml
/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/metplus_macros.jinja
End exregional_run_met_gridstat_or_pointstat_vx.sh at Thu Feb 29 10:24:30 UTC 2024 with error code 1 (time elapsed: 00:00:11)
FATAL ERROR:
ERROR:
  From script:  "JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX"
  Full path to script:  "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/jobs/JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX"
Call to ex-script corresponding to J-job "JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX" failed.
Exiting with nonzero status.
End JREGIONAL_RUN_MET_GRIDSTAT_OR_POINTSTAT_VX at Thu Feb 29 10:24:30 UTC 2024 with error code 1 (time elapsed: 00:00:12)

It's not clear to me where the parm/metplus/metplus_macros.jinja file is being used. In your test, it died immediately after the call to uw template render, but it looks like it might have progressed slightly in my test. Is it used in uwtools, or in METplus' run_metplus.py? Thanks!

gsketefian commented 5 months ago

@MichaelLueken Thanks for retesting the PR. The file parm/metplus/metplus_macros.jinja contains a set of jinja2 macros that are used in various places in most of the new METplus conf templates. This file is imported into the templates using this statement:

{%- import metplus_templates_dir ~ '/metplus_macros.jinja' as metplus_macros %}

I suspect it is this statement that prints out the path to the macros file in the log file. The templates that do this import are:

EnsembleStat.conf
GenEnsProd.conf
GridStat_ensmean.conf
GridStat_ensprob.conf
GridStat_or_PointStat.conf
PointStat_ensmean.conf
PointStat_ensprob.conf

The templates/files that don't import are Pb2nc_obs.conf, PcpCombine.conf, and common.conf. The templates that do import are also the ones that need to read in the jinja variable vx_config_dict (which is actually a python dictionary), so further testing is needed to see if it's the import command that's the problem or the use of vx_config_dict.

gsketefian commented 5 months ago

@MichaelLueken I also see that in the next spack-stack version (spack-stack-1.6.0; not yet implemented in the SRW App), the jinja version reverts to py-jinja2-3.0.3 (from py-jinja2-3.1.2 in spack-stack-1.5.0 (which is what my PR has) and spack-stack-1.5.1). This is the same version as in the conda installation of jinja in my branch (jinja2 v3.0.3). I would like to try this out (by updating the spack-stack in build_hera_intel.lua), but it would take a lot of work to also update the versions in srw_common.lua.

gsketefian commented 5 months ago

Here's the issue @mkavulich created in uwtools repo regarding use of jinja2 macros.

MichaelLueken commented 5 months ago

@gsketefian -

I have gone in and made the necessary modifications to your feature/metplus_conf_templates branch to build and run using spack-stack v1.6.0. Unfortunately, this work required a lot more than just updating the path in build_hera_intel.lua and the entries in srw_common.lua. Both FMS and ESMF entries had to be changed, which required an update to the weather-model hash. This required renaming NEMS/nems to UFS/ufs throughout the SRW App. Additionally, the ush/set_ozone_param.py script had to be updated since the ozphys scheme had been removed in the SDF. If you would like to see the changes that were made, please see - /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test.

I have queued up the MET_ensemble_verification_only_vx test using my updated working copy of your branch.

I'll let you know how the test turns out once complete.

MichaelLueken commented 5 months ago

@gsketefian -

The MET_ensemble_verification_only_vx test is still failing in run_MET_GridStat_vx_APCP01h_mem001:

==>> filepath = /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/tmp.vx_config_det_dict.split_fcst_obs.txt
[2024-03-01T18:53:01]    DEBUG Command: uw template render -i /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GridStat_or_PointStat.conf -o /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_ensemble_verification_only_vx/2019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf -v --values-file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.aOzaNy.yaml
[2024-03-01T18:53:01]    DEBUG Internal arguments:
[2024-03-01T18:53:01]    DEBUG ---------------------------------------------------------------------
[2024-03-01T18:53:01]    DEBUG       input_file: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GridStat_or_PointStat.conf
[2024-03-01T18:53:01]    DEBUG      output_file: /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_ensemble_verification_only_vx/2019061500//mem001/metprd/GridStat/GridStat_APCP01h_mem001.conf
[2024-03-01T18:53:01]    DEBUG      values_file: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.aOzaNy.yaml
[2024-03-01T18:53:01]    DEBUG    values_format: yaml
[2024-03-01T18:53:01]    DEBUG        overrides: {}
[2024-03-01T18:53:01]    DEBUG    values_needed: False
[2024-03-01T18:53:01]    DEBUG          dry_run: False
[2024-03-01T18:53:01]    DEBUG ---------------------------------------------------------------------
[2024-03-01T18:53:01]    DEBUG Read initial values from /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.aOzaNy.yaml
/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/metplus_macros.jinja
End exregional_run_met_gridstat_or_pointstat_vx.sh at Fri Mar  1 18:53:01 UTC 2024 with error code 1 (time elapsed: 00:00:15)

Looks like we'll need a fix for uwtools handling of Jinja macros before moving forward.

gsketefian commented 5 months ago

@MichaelLueken Thanks for doing all that work Mike! At least now we know it's not (just?) the jinja2 version. I will let the uwtools developers know in Issue #422 in that repo.

MichaelLueken commented 5 months ago

@gsketefian -

I have ran two more tests using uwtools v2.1.0, one using spack-stack v1.5.0 and one using spack-stack v1.6.0. Both tests failed.

Using spack-stack v1.5.0:

[2024-03-14T15:51:51]    DEBUG Read initial values from /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.SvtOb9.yaml
Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/bin/uw", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 62, in main
    sys.exit(0 if modes[args[STR.mode]](args) else 1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 460, in _dispatch_template
    return {
           ^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 475, in _dispatch_template_render
    uwtools.api.template.render(
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/api/template.py", line 49, in render
    result = _render(
             ^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/config/jinja2.py", line 186, in render
    rendered = template.render()
               ^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/config/jinja2.py", line 83, in render
    return self._template.render(self._values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.1.2-3qntrcl/lib/python3.10/site-packages/jinja2/environment.py", line 1301, in render
    self.environment.handle_exception()
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.1.2-3qntrcl/lib/python3.10/site-packages/jinja2/environment.py", line 936, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 123, in top-level template code
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.1.2-3qntrcl/lib/python3.10/site-packages/jinja2/loaders.py", line 218, in get_source
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/metplus_macros.jinja

and using spack-stack v.1.6.0:

[2024-03-15T15:53:35]    DEBUG Read initial values from /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.uBD5dZ.yaml
Traceback (most recent call last):
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/bin/uw", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 62, in main
    sys.exit(0 if modes[args[STR.mode]](args) else 1)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 460, in _dispatch_template
    return {
           ^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/cli.py", line 475, in _dispatch_template_render
    uwtools.api.template.render(
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/api/template.py", line 49, in render
    result = _render(
             ^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/config/jinja2.py", line 186, in render
    rendered = template.render()
               ^^^^^^^^^^^^^^^^^
  File "/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/config/jinja2.py", line 83, in render
    return self._template.render(self._values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.0.3-2g4rjkz/lib/python3.10/site-packages/jinja2/environment.py", line 1291, in render
    self.environment.handle_exception()
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.0.3-2g4rjkz/lib/python3.10/site-packages/jinja2/environment.py", line 925, in handle_exception
    raise rewrite_traceback_stack(source=source)
  File "<template>", line 123, in top-level template code
  File "/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.0.3-2g4rjkz/lib/python3.10/site-packages/jinja2/loaders.py", line 214, in get_source
    raise TemplateNotFound(template)
jinja2.exceptions.TemplateNotFound: /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/metplus_macros.jinja

The metplus_macros.jinja file is present:

-rw-r--r-- 1 Michael.Lueken epic 12582 Feb 22 20:51 /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/metplus_macros.jinja

It is interesting to see that both conda (/scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/conda/envs/srw_app/lib/python3.11/site-packages/uwtools/config/jinja2.py) and the spack-stack jinja2 packages are being shown, rather than just the conda environment. It looks like this might be pointing to issues between spack-stack and conda.

@christinaholtNOAA and @maddenp-noaa -

Are additional options required in order to exercise the jinja macro file? I just want to make sure before stating that this is an issue with spack-stack and conda. Thank you very much for your time.

maddenp-noaa commented 5 months ago

@MichaelLueken As of uwtools v2.1.0, templates referenced via e.g. import Jinja2 expressions should be found if either

  1. The referenced template is in the same directory as the referencing template, or
  2. The --search-path option to uw is used to specify a colon-delimited (similar to e.g. PATH) set of paths to search for referenced templates.

It looks like condition 1 isn't satisfied in either case. What does the uw invocation look like? It might need to be updated to specify --search-path.

It may be worth noting that template references in Jinja2 don't have any a priori meaning: They're just names from Jinja2's perspective, and the template-rendering code (uw in this case) has to define what they mean. In uwtools v2.1.0, we've defined them to mean "Files at filesystem paths" and search by default in the directory containing the main template file. A custom search path defined by --search-path overrides this default.

So, for example, if I have $HOME/template.jinja2 with content

{% import "macros.jinja2" as macros -%}
{{ macros.double(1) }}

and /tmp/macros.jinja2 with content

{% macro double(n) %}{{ n * 2 }}{% endmacro %}

I can

% uw template render --input-file ~/template.jinja2 --search-path /tmp
2

Without --search-path I get an exception and stack trace similar to what you posted.

You can use absolute paths in template references with --search-path /, but that's probably bad for portability. Relative references should be ok, e.g. a foo/macros.jinja2 reference should work if /tmp/foo/macros.jinja2 exists and --search-path /tmp is specified.

With respect to the mix of paths between conda-supplied Python and spack-supplied Python, it's concerning but not necessarily a problem if the packages (jinja2 in this case) are compatible. I'd be interested to know if PYTHONPATH is set at the time uw is invoked. If you might be able to echo $PYTHONPATH just before the uw invocation that led to those stacktraces, I'd like to see that.

MichaelLueken commented 5 months ago

@maddenp-noaa -

Thank you very much for the insight and suggestion of trying to add the --search-path option. I will try again using --search-path to see if that corrects the issue.

The current uw invocation that is failing looks like:

[2024-03-15T15:53:35]    DEBUG Command: uw template render -i /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GenEnsProd.conf -o /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_ensemble_verification_only_vx/2019061500/metprd/GenEnsProd/GenEnsProd_ADPSFC.conf -v --values-file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.uBD5dZ.yaml

I will also add echo $PYTHONPATH to the line before the uw invocation and will report back on that as well.

MichaelLueken commented 5 months ago

@maddenp-noaa -

The echo $PYTHONPATH printed the following:

/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-pyyaml-6.0-nlgnppd/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-jinja2-3.0.3-2g4rjkz/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-markupsafe-2.1.3-q2gnvtr/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-f90nml-1.4.3-dz2bima/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-cython-0.29.36-r5a6vgy/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-xarray-2023.7.0-tbnsqaw/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-packaging-23.1-ke4uivr/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-pandas-1.5.3-7hathmq/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-xlwt-1.3.0-f7bd3ip/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-xlsxwriter-3.1.7-pmb7bam/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-xlrd-2.0.1-56miqff/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-pyxlsb-1.0.10-jlrsqbq/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-pytz-2023.3-5nykyqa/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-python-dateutil-2.8.2-lfyvko2/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-six-1.16.0-6knf6az/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-openpyxl-3.1.2-rv2hu6u/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-et-xmlfile-1.0.1-7goajc5/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-numexpr-2.8.4-ega4hge/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-bottleneck-1.3.7-snzpnsr/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-netcdf4-1.5.8-qs672u4/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-cftime-1.0.3.4-rnt66hg/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/bufr-12.0.1-nz7jhfd/lib64/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/bufr-12.0.1-nz7jhfd/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-numpy-1.22.3-snzl53m/lib/python3.10/site-packages:/scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.6.0/envs/unified-env-rocky8/install/intel/2021.5.0/py-setuptools-63.4.3-wvdr656/lib/python3.10/site-packages

So, something is happening before the uw invocation that is changing the PYTHONPATH from the conda environment to spack-stack. I'll try setting PYTHONPATH to the conda environment before the uw invocation and see if that helps.

I updated the uw invocation with the following:

[2024-03-15T19:22:08] DEBUG Command: uw template render -i /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GenEnsProd.conf -o /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_ensemble_verification_only_vx/2019061500/metprd/GenEnsProd/GenEnsProd_APCP01h.conf --search-path /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus -v --values-file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.Ylp4T3.yaml

The test is still failing to find the jinja macro file, metplus_macros.jinja.

maddenp-noaa commented 5 months ago

In re: PYTHONPATH, conda does not set PYTHONPATH for its own use, and the environments it creates do not depend on it for correct operation. But it also doesn't interfere with it if set, so the Python interpreter will still use it, which I think is what we're seeing here. You could check this by wrapping the uw invocation in a subshell with parens, and unsetting PYTHONPATH in the subshell. That would look like changing an invocation like

uw <args>

to

(unset PYTHONPATH && uw <args>)

The parens create a temporary subshell, inside which PYTHONPATH will be unset; when the uw command completes and the subshell exits, PYTHONPATH will still be set as before in the parent shell, no harm done. If you try this and look at the failure stack trace, I would hope that you'd only see conda paths, and no spack paths. I'd love to know if that works.

In re: the templates, in /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/GenEnsProd.conf I see

{%- import metplus_templates_dir ~ '/metplus_macros.jinja' as metplus_macros %}

I don't see where metplus_templates_dir is defined -- do you know what its value is? If it were defined as, say, /path/to, then the concatenated value of the imported template would be /path/to/metplus_macros.jinja, and uw would then be looking for file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/parm/metplus/path/to/metplus_macros.jinja in combination with the --search-path value. Whatever metplus_templates_dir is, does metplus_macros.jinja exist at that full path? If the combination of the metplus_templates_dir value and /metplus_macros.jinja is, by itself, an absolute path to the macros file in question, try setting --search-path /, which should allow a valid absolute path to be resolved.

Maybe metplus_templates_dir was defined in file /scratch2/NAGAPE/epic/Michael.Lueken/metplus_test/tests/WE2E/met_plus_settings.Ylp4T3.yaml but it looks like that file doesn't exist now, and the other met_plus_settings.*.yaml files in that directory are not readable by me, so I'm not able to deduce much more at the moment.

MichaelLueken commented 5 months ago

@maddenp-noaa -

The metplus_templates_dir is defined as METPLUS_CONF in scripts/exregional_run_met_genensprod_or_ensemblestat.sh. Further, METPLUS_CONF is defined as /path/to/ufs-srweather-app/parm/metplus in ush/config_defaults.yaml. Since metplus_templates_dir and /metplus_macros.jinja is the absolute path to the macros file itself, I will try setting --search-path /.

I will also attempt to set the parens with unset PYTHONPATH for the uw invocation and let you know what the stack trace looks like (and if this corrects the issue).


When I use both the parens with unset PYTHONPATH for the uw invocation and setting --search-path /, the number of failures significantly decrease. I'm now finding that the run_MET_EnsembleStat_vx_* tasks (run_MET_EnsembleStat_vx_APCP01h, run_MET_EnsembleStat_vx_APCP03h, run_MET_EnsembleStat_vx_APCP06h, run_MET_EnsembleStat_vx_REFC, run_MET_EnsembleStat_vx_RETOP, run_MET_EnsembleStat_vx_ADPSFC, and run_MET_EnsembleStat_vx_ADPUPA) are the only failures, when previously, all verification tasks would fail.

These tasks are failing due to deprecated config items being found in the METplus config file.

Thank you very much for your assistance to get the new jinja macros functionality to work in the SRW App!

MichaelLueken commented 5 months ago

@gsketefian -

A strategy to get the updated uwtools to work with this PR has been found. However, the run_MET_EnsembleStat_vx_* tasks are failing with the following error messages:

03/18 15:39:11.313Z metplus.9496b9de (config_validate.py:104) ERROR: DEPRECATED CONFIG ITEMS WERE FOUND. PLEASE FOLLOW THE INSTRUCTIONS TO UPDATE THE CONFIG FILES
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: EnsembleStat functionality has been moved to GenEnsProd. The required changes to the config files depend on the type of evaluation that is being performed. Please navigate to the upgrade instructions: https://metplus.readthedocs.io/en/develop/Users_Guide/release-notes.html#metplus-wrappers-upgrade-instructions
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_LATLON should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_MEAN should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_STDEV should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_MINUS should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_PLUS should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_MIN should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_MAX should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_RANGE should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_VLD_COUNT should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_FREQUENCY should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_NEP should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: ENSEMBLE_STAT_ENSEMBLE_FLAG_NMEP should be removed
03/18 15:39:11.314Z metplus.9496b9de (config_validate.py:107) ERROR: Please navigate to the upgrade instructions: https://metplus.readthedocs.io/en/develop/Users_Guide/release-notes.html#metplus-wrappers-upgrade-instructions
03/18 15:39:11.316Z metplus.9496b9de (run_util.py:66) ERROR: Correct configuration variables and rerun. Exiting.

The MET and METplus versions have been updated from 10.1.1 to 11.1.0 and 4.1.1 to 5.1.0, respectively. These modifications came in as part of the transition from spack-stack v1.4.1 to v1.5.0. Is it possible that these failures are due to these updated MET and METplus versions?

You can see the updated experiment directory on Hera - /scratch2/NAGAPE/epic/Michael.Lueken/expt_dirs/MET_ensemble_verification_only_vx

maddenp-noaa commented 5 months ago

@MichaelLueken as an alternative to the

(unset PYTHONPATH && uw <args>)

I suggested previously, you might like to try the slightly more concise

PYTHONPATH= uw <args>

syntax. This should have the same effect: PYTHONPATH will be temporarily set to empty only for the duration of the uw <args> command, but its value will be unchanged in the containing shell.

gsketefian commented 5 months ago

@MichaelLueken @maddenp-noaa Thanks for looking into this further. @MichaelLueken I had removed those settings deprecated settings from EnsembleStat.conf a few weeks ago and things worked with the old uwtools. I'm now in the process of merging the latest develop branch will test with the newest uwtools and the @maddenp-noaa's suggested method of calling the templater, i.e. with the parentheses and unset PYTHONPATH along with the --search-path / option for uwtools.

gsketefian commented 5 months ago

@MichaelLueken I just merged in develop and tried rerunning on Hera but am getting an error (see below). So I decided to start from scratch and test just the develop branch by performing the usual steps: 1) Clone step:

   > git clone -b develop https://github.com/ufs-community/ufs-srweather-app.git

Cloning worked.

2) Checkout step:

   > cd ufs-srweather-app
   > ./manage_externals/checkout_externals

Checking out externals worked.

3) Build step:

   > ./devbuild.sh --platform=hera

This failed with the following error:

...
Currently Loaded Modules:
  1) sutils/default                   12) curl/8.1.2         23) netcdf-c/4.9.2          34) g2tmpl/1.10.2           45) w3nco/2.4.1
  2) intel/2022.1.2                   13) cmake/3.23.1       24) netcdf-fortran/4.6.0    35) ip/4.3.0                46) wrf-io/1.2.0
  3) stack-intel/2021.5.0             14) libjpeg/2.1.0      25) parallel-netcdf/1.12.2  36) sp/2.3.3                47) wgrib2/2.0.8
  4) impi/2022.1.2                    15) jasper/2.0.32      26) parallelio/2.5.10       37) w3emc/2.10.0            48) srw_common
  5) stack-intel-oneapi-mpi/2021.5.1  16) zlib/1.2.13        27) esmf/8.4.2              38) gftl/1.9.0              49) nccmp/1.9.0.1
  6) gettext/0.19.8.1                 17) libpng/1.6.37      28) fms/2023.01             39) gftl-shared/1.5.0       50) nco/4.9.3
  7) libxcrypt/4.4.35                 18) snappy/1.1.10      29) bacio/2.4.1             40) fargparse/1.4.2         51) build_hera_intel
  8) sqlite/3.42.0                    19) zstd/1.5.2         30) crtm-fix/2.4.0_emc      41) mapl/2.35.2-esmf-8.4.2
  9) util-linux-uuid/2.38.1           20) c-blosc/1.21.4     31) git-lfs/2.10.0          42) nemsio/2.5.4
 10) python/3.10.8                    21) pkg-config/0.27.1  32) crtm/2.4.0              43) sfcio/1.4.1
 11) stack-python/3.10.8              22) hdf5/1.14.0        33) g2/3.4.5                44) sigio/2.3.2

... Generate CMAKE configuration ...
cmake: /lib64/libc.so.6: version `GLIBC_2.25' not found (required by /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.5.0/envs/unified-env-rocky8/install/intel/2021.5.0/openssl-1.1.1u-c7lzo3w/lib/libcrypto.so.1.1)
... Compile and install executables ...
make: *** No rule to make target `install'.  Stop.

This is the same error as I when I try to build my branch (feature/metplus_conf_templates). Seems like there's a missing library on Hera. Is this expected? Or a bug? Should I work on a different machine? Thanks.

RatkoVasic-NOAA commented 5 months ago

@gsketefian are you logged in Rocky8 login node? What does your hostname says? I followed your instruction:

[100%] Built target ufs-weather-model
Install the project...
-- Install configuration: "RELEASE"
-- Installing: /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/brisdi/ufs-srweather-app/exec/ufs_srweather_app.settings
Hera:/scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/brisdi/ufs-srweather-app>hostname
hfe09
gsketefian commented 5 months ago

@RatkoVasic-NOAA Here's my login node info:

$ hostname
hfe01
$ uname -a
Linux hfe01 3.10.0-1160.105.1.el7.x86_64 #1 SMP Thu Dec 7 15:39:45 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I'm on node 1, as you can see. I guess that's not a Rocky8 node. Here's the same info on node 8 (where you tested):

$ hostname
hfe08
$ uname -a
Linux hfe08 4.18.0-477.27.1.el8_8.x86_64 #1 SMP Wed Sep 20 15:55:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

I guess Linux 4.18.0 corresponds to Rocky8 while 3.10.0 doesn't. Do you know which nodes are Rocky8?

MichaelLueken commented 5 months ago

@gsketefian -

hfe05 - hfe12 should be the Rocky8 front end nodes on Hera.

gsketefian commented 5 months ago

@RatkoVasic-NOAA @MichaelLueken I tried again on hfe08 and the build seems to be working. And thanks for the info on the nodes that have Rocky8. I'm not up-to-date on the latest system developments. Do we need to put something in the quick-start guide about this, or will the other fe nodes get updated soon as well?

RatkoVasic-NOAA commented 5 months ago

@RatkoVasic-NOAA @MichaelLueken I tried again on hfe08 and the build seems to be working. And thanks for the info on the nodes that have Rocky8. I'm not up-to-date on the latest system developments. Do we need to put something in the quick-start guide about this, or will the other fe nodes get updated soon as well?

Soon we are going to switch completely to Rocky8, so no need for new instructions. Rocky8 is already default login node. CentOS is still there if some things haven't moved correctly to Rocky8.

gsketefian commented 5 months ago

@RatkoVasic-NOAA Ok, sounds good. I always log into the same node because I use X2Go. That way I can resume my instance of X2Go. But that node happened not to be one of the Rocky8 ones. But now I know, thanks!

gsketefian commented 5 months ago

@MichaelLueken @christinaholtNOAA I think the PR is ready now for review. Here's what I did: 1) Changed the uwtools command to include the --search-path "/" option. I did not need to do anything with PYTHONPATH. The new templating command in the vx ex-scripts looks like this:

   uw template render \
     -i ${metplus_config_tmpl_fp} \
     -o ${metplus_config_fp} \
     --verbose \
     --values-file "${tmpfile}" \
     --search-path "/"

2) I removed the directory ufs-srweather-app/parm/metplus/uw_tools containing the old version of uwtools that I was using. 3) I reran all the vx WE2E tests, and they all passed. These are:

   MET_ensemble_verification
   MET_ensemble_verification_only_vx
   MET_ensemble_verification_only_vx_time_lag
   MET_ensemble_verification_winter_wx
   MET_verification
   MET_verification_only_vx
   MET_verification_winter_wx

For this PR, I was in the past doing regression tests on my own. I haven't done those. I might do them, but I think it's ready now to be reviewed and jenkins-tested. Thanks.

MichaelLueken commented 5 months ago

@gsketefian -

Sounds good! I'll go ahead and run the verification tests as well and will provide approval once they pass.

mkavulich commented 4 months ago

@gsketefian Sorry to throw in one last comment so late but this should be easy to address: the parm/ directory is meant for "parameter" files; I know that isn't well-defined, but to me that means it shouldn't include scripts. Can we put the script file parm/metplus/separate_fcst_obs_info.py into ush/ instead? If it makes more sense, we could put it in a new ush/metplus directory.

gsketefian commented 4 months ago

@gsketefian Sorry to throw in one last comment so late but this should be easy to address: the parm/ directory is meant for "parameter" files; I know that isn't well-defined, but to me that means it shouldn't include scripts. Can we put the script file parm/metplus/separate_fcst_obs_info.py into ush/ instead? If it makes more sense, we could put it in a new ush/metplus directory.

@mkavulich No worries, yes I'll move it.

gsketefian commented 3 months ago

@mkavulich I cleaned up and renamed the script separate_fcst_obs_info.py. It is now at ush/metplus/decouple_fcst_obs_vx_config.py. Sorry, it wasn't ready for review before, and I had forgotten to go back and clean it up. It's pretty different from the previous version so feel free to take a look again. To address some of your comments: 1) I've removed all the commented code. 2) The print() statements are now replaced with logging.debug(). 3) I now use your code snippet to parse the ..._both values, and that code is now in a separate function to avoid repetition. 4) There is now an argument to the main function (decouple_fcst_obs_vx_config) that specifies the format in which to write to the output file. It can be txt for pretty-printed text output or yaml for yaml output. We need txt for now because the output file is read in by the (bash) ex-scripts for some of the vx tasks (e.g. exregional_run_met_gridstat_or_pointstat_vx.sh), which then take the contents of that file and place it in the settings variable that gets passed to the uw templater tool. Changing the output of decouple_fcst_obs_vx_config.py to yaml format would require additional changes to those ex-scripts (e.g. to also use yaml for the remaining variables in settings), and I'd rather leave that to another PR.

A final change I made is to call decouple_fcst_obs_vx_config.py only once in a separate workflow task (well once for deterministic vx and another time for ensemble vx). Previously, it was being called for most vx tasks and generating the same output (and I think all those tasks were writing to the same file, which is bad). The output files from these two calls now get placed in the main experiment directory and are called vx_config_det.txt and vx_config_ens.txt.

I've tested this latest code on only one WE2E vx test (MET_ensemble_verification_only_vx). I am now running them on the remaining 6 and will get back here with results.

gsketefian commented 3 months ago

@MichaelLueken @mkavulich I tested the latest on the following vx WE2E tests on Hera, and they all passed:

MET_ensemble_verification
MET_ensemble_verification_only_vx
MET_ensemble_verification_only_vx_time_lag
MET_ensemble_verification_winter_wx
MET_verification
MET_verification_only_vx
MET_verification_winter_wx
gsketefian commented 3 months ago

@MichaelLueken Sorry, I forgot to add the j-job and ex-script for the new jobs. Both jobs use the same j-job and ex-script. They should be there now.

mkavulich commented 3 months ago

@gsketefian Thanks for addressing my earlier comments. I am a bit concerned about the amount of custom code that the latest commits are adding for writing this custom text format to be read in by metplus tasks rather than just using the original yaml. Is this because of the format that METplus is expecting for config files?

MichaelLueken commented 3 months ago

@gsketefian -

Thank you very much for adding the JREGIONAL_PARSE_VX_CONFIG j-job and the exregional_parse_vx_config.sh ex-script! The verification WE2E tests are now successfully passing:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
MET_ensemble_verification_only_vx_20240501151627                   COMPLETE               0.99
MET_ensemble_verification_only_vx_time_lag_20240501151629          COMPLETE               3.75
MET_ensemble_verification_winter_wx_20240501151632                 COMPLETE             109.74
MET_verification_20240501151634                                    COMPLETE               9.82
MET_verification_only_vx_20240501151635                            COMPLETE               0.23
MET_verification_winter_wx_20240501151637                          COMPLETE              15.99
grid_SUBCONUS_Ind_3km_ics_FV3GFS_lbcs_FV3GFS_suite_WoFS_v0_202405  COMPLETE              43.25
grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GFS_v16_2024050115164  COMPLETE              19.67
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE             203.44

Now preparing to kick off Jenkins testing.

gsketefian commented 3 months ago

@gsketefian Thanks for addressing my earlier comments. I am a bit concerned about the amount of custom code that the latest commits are adding for writing this custom text format to be read in by metplus tasks rather than just using the original yaml. Is this because of the format that METplus is expecting for config files?

@mkavulich The files being read in by decouple_fcst_obs_vx_config.py are yaml (these files are vx_config_det.yaml and vx_config_ens.yaml; I'm not sure what you mean by "the original yaml"). Their contents are in the form

FIELD_GROUP1:
    FIELD1:
        LEVEL1: list_of_thresholds
        LEVEL2: list_of_thresholds
        ...
    FIELD2:
        LEVEL1: list_of_thresholds
        LEVEL2: list_of_thresholds
        ...
    ...
...

Aside: The FIELD_GROUPs are the groups of fields for which we've pooled the verification into SRW workflow tasks, e.g. for the field group ADPSFC, we verify a bunch of fields like TMP, DPT, RH, etc, but for the field group REFC, we verify only the field REFC.

The parsing by decouple_fcst_obs_vx_config.py is needed because each item in this file may contain both the forecast and observation value, e.g. the key FIELD1 may have the form FIELD1_fcst%%FIELD1_obs, where %% is the delimiter string. I do this to avoid needing another level of configuration at the top that distinguishes between forecast and observation values, e.g.

fcst:
    FIELD_GROUP1_fcst:
    ...

obs:
    FIELD_GROUP1_obs:
    ...

I know there's the &label and <<: *label mechanism in yaml, but even with that it seems to me things would get messy. The main difficulty is that there are cases where the names of different forecast fields correspond to the same observation field name. That's a problem because that observation field name can only be used once as a dictionary key, so you'd have to resort to some convoluted solution. I also like this "coupled" approach because the forecast and obs values are always right next to each other; no need to introduce labels, and no need to search in the config file for what obs item further down matches a fcst item up above. This was a much more compact (and to me, intuitive) solution for this situation.

MichaelLueken commented 3 months ago

The Jenkins tests have all successfully passed. Moving forward with merging this work now.