ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
55 stars 116 forks source link

[develop]: RUN_FCST failures when using Jinja-templated values in `predef_grid_params.yaml` #1006

Open gspetro-NOAA opened 7 months ago

gspetro-NOAA commented 7 months ago

Expected behavior

See GitHub Discussion #1000 for full context. Since ush/predef_grid_params.yaml is expecting hard-coded values for its grid parameters, not Jinja-templated YAML {{...}} entries, experiment generation should fail with an appropriate error message when grid parameters are set to Jinja-templated values (e.g., WRTCMP_write_tasks_per_group: '{{ LAYOUT_Y }}'). Alternatively, the code should be refactored so that ush/predef_grid_params.yaml accepts Jinja-templated values.

Current behavior

If the user sets WRTCMP_write_tasks_per_group: '{{ LAYOUT_Y }}', the experiment is generated, but the value of NNODES_RUN_FCST cannot be properly calculated, and the experiment fails at run_fcst with an error message similar to the following:

Submission of run_fcst_mem000 failed! qsub: directive error: -l select={{ task_run_fcst.NNODES_RUN_FCST // 1 }}:mpiprocs=128:ncpus=128

var_defns.sh in the failed SRW run shows NNODES_RUN_FCST='{{ (PE_MEMBER01 + PPN_RUN_FCST - 1) // PPN_RUN_FCST }}' Hardcoding WRTCMP_write_tasks_per_group allows the experiment to run.

Machines affected

Probably all, but certainly Derecho. See GitHub Discussion #1000 for full context.

Steps To Reproduce

Set the grid:

"RRFS_CONUS_25km":
  GRID_GEN_METHOD: "ESGgrid"
  ESGgrid_LON_CTR: -97.5
  ESGgrid_LAT_CTR: 38.5
  ESGgrid_DELX: 25000.0
  ESGgrid_DELY: 25000.0
  ESGgrid_NX: 219
  ESGgrid_NY: 131
  ESGgrid_PAZI: 0.0
  ESGgrid_WIDE_HALO_WIDTH: 6
  DT_ATMOS: 150
  LAYOUT_X: 5
  LAYOUT_Y: 2
  BLOCKSIZE: 40
  QUILTING:
    WRTCMP_write_groups: 1
    WRTCMP_write_tasks_per_group: '{{ LAYOUT_Y }}'
    WRTCMP_output_grid: "lambert_conformal"
    WRTCMP_cen_lon: '{{ task_make_grid.ESGgrid_LON_CTR }}'
    WRTCMP_cen_lat: '{{ task_make_grid.ESGgrid_LAT_CTR }}'
    WRTCMP_stdlat1: '{{ task_make_grid.ESGgrid_LAT_CTR }}'
    WRTCMP_stdlat2: '{{ task_make_grid.ESGgrid_LAT_CTR }}'
    WRTCMP_nx: 217
    WRTCMP_ny: 128
    WRTCMP_lon_lwr_left: -122.719528
    WRTCMP_lat_lwr_left: 21.138123
    WRTCMP_dx: 25000.0
    WRTCMP_dy: 25000.0

After generating the experiment, the var_defns.sh file shows:

NNODES_RUN_FCST='{{ (PE_MEMBER01 + PPN_RUN_FCST - 1) // PPN_RUN_FCST }}'

and the test fails with qsub: directive error: -l select={{ task_run_fcst.NNODES_RUN_FCST // 1 }}:mpiprocs=128:ncpus=128.

To correct this behavior, it is necessary to hard code WRTCMP_write_tasks_per_group to a particular value in ush/predef_grid_params.yaml.

Detailed Description of Fix (optional)

Additional Information (optional)

Possible Implementation (optional)

Output (optional)