ufs-community / ufs-srweather-app

UFS Short-Range Weather Application
Other
56 stars 119 forks source link

Wrapper scripts need improvement (does not work properly on every machine) #909

Closed RatkoVasic-NOAA closed 10 months ago

RatkoVasic-NOAA commented 1 year ago

Expected behavior

Wrapper scripts are very important for running individual parts of the SRW-workflow without invoking workflows like rocoto, ecflow....They are very useful for debugging purposes and should be maintained regularly.

Current behavior

On some machines wrapper scripts are failing on various places.

Machines affected

Not tested everywhere, but some problems we saw on Gaea and Hera

Steps To Reproduce

  1. Clone ufs-weather-model, get external dependencies and compile
  2. Create new ush/config.yaml file for your experiment
  3. Run launch_FV3LAM_wflow.sh to create run directory
  4. copy wrapper scripts to run directory
  5. run individual scripts: ./run_make_grid.sh, ./run_get_ics.sh, ./run_get_lbcs.sh, ./run_make_orog.sh, ./run_make_sfc_climo.sh...
  6. follow docs (which also needs to be updated).
natalie-perlin commented 1 year ago

@christinaholtNOAA @EdwardSnyder-NOAA - your help is really needed! How do we run the SRW using wrapper scripts, if starting running a default community case? For example, on Orion:

What needs to be configured by the user?

 cd ./usf-srweather-app
./devbuild.sh -p=orion
cd ./ush
cp -pv config.community.yaml config.yaml
vim config.yaml                    # set MACHINE: orion and ACCOUNT: epic
vim machine/orion.yaml      # EXTRN_MDL_DATA_STORES: disk aws nomads
# Any changes in ./parm/wflow or ./ush/predef_grid_param.yaml?
cd ../
module use $PWD/modulefiles
module load wflow_orion
conda activate workflow_tools
./generate_FV3LAM_wflow.py
export EXPTDIR=<EXPTDIR>  # as indicated after generating workflow
cp ./wrappers/*.sh $EXPTDIR/.
cd $EXPTDIR
module load build_orion_intel
./run_make_grid.sh

What are the next steps needed to submit jobs and prepare job_scripts? Which tasks require job scripts, where to get the information on job requirement for the task? ( run_get_ics and run_get_lbcs fail as they are, but they do not require job scripts)

EdwardSnyder-NOAA commented 1 year ago

I was able to run the SRW wrapper scripts on the default community case for various RDHPCs using wrapper script PR. The only file I made changes to was the config.yaml. Please see the 'Description of Changes' in my PR for what variables were edited. The other files you mentioned were not modified. The rest of your steps checkout. The one thing that is missing, is that these scripts need to run on a compute node. So before you run module use or load, do salloc -N 1 -A epic -n 40 -t 00:30:00 -q batch on Orion. Running on a compute node will eliminate the need for a job card.

natalie-perlin commented 1 year ago

@EdwardSnyder-NOAA - thank you for your comments. The tasks run_get_ics.sh and run_get_lbcs.sh are still failing. How do users find out which tasks need to be submitted to a compute node, and which could be run as just scripts?

EdwardSnyder-NOAA commented 1 year ago

Can you send me the path to the experiment you are running, so I can take a look at the log files?

All tasks run on a compute node. This is done by allocating a compute node via 'salloc' and then ssh into it. Using a compute node is mentioned in the "Attention" section of the wrapper scripts documentation), which is something I forgot to mention during the meeting yesterday.

The wrapper scripts were tested by hand running on a compute node. No job cards were used. It is possible to add job cards though. See wrapper_srw_ftest.sh in my PR for guidance.

natalie-perlin commented 1 year ago

@EdwardSnyder-NOAA - A directory where the workflow was generated is /work/noaa/epic/nperlin/SRW/ufs-srweather-app/ush, experiment directory is /work/noaa/epic/nperlin/SRW/expt_dirs/test_community

EdwardSnyder-NOAA commented 1 year ago

The 'disk' option doesn't need to be added to the EXTRN_MDL_DATA_STORES variable in the machine file. The retrieve script will check the disk if you provide the USE_USER_STAGED_EXTRN_FILES and EXTRN_MDL_SOURCE_BASEDIR_I/LBCS variables. See the suggestions below:

Add the lines that start with (**) to your config.yaml file before generating the experiment.

task_get_extrn_ics:
  EXTRN_MDL_NAME_ICS: FV3GFS
  FV3GFS_FILE_FMT_ICS: grib2
  **USE_USER_STAGED_EXTRN_FILES: true
  **EXTRN_MDL_SOURCE_BASEDIR_ICS: /work/noaa/epic/role-epic/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518
task_get_extrn_lbcs:
  EXTRN_MDL_NAME_LBCS: FV3GFS
  LBC_SPEC_INTVL_HRS: 6
  FV3GFS_FILE_FMT_LBCS: grib2
  **USE_USER_STAGED_EXTRN_FILES: true
  **EXTRN_MDL_SOURCE_BASEDIR_LBCS: /work/noaa/epic/role-epic/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518

Or you can add them in the current experiment, by updating the variables (**) in the var_defn.sh file of your current experiments working directory (EXPTDIR).

# [task_get_extrn_ics]
EXTRN_MDL_NAME_ICS='FV3GFS'
EXTRN_MDL_ICS_OFFSET_HRS='0'
FV3GFS_FILE_FMT_ICS='grib2'
EXTRN_MDL_SYSBASEDIR_ICS=''
**USE_USER_STAGED_EXTRN_FILES='TRUE'
**EXTRN_MDL_SOURCE_BASEDIR_ICS='/work/noaa/epic/role-epic/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518'
EXTRN_MDL_FILES_ICS=''

# [task_get_extrn_lbcs]
EXTRN_MDL_NAME_LBCS='FV3GFS'
LBC_SPEC_INTVL_HRS='6'
EXTRN_MDL_LBCS_OFFSET_HRS='0'
FV3GFS_FILE_FMT_LBCS='grib2'
LBCS_SEARCH_HRS='6'
EXTRN_MDL_LBCS_SEARCH_OFFSET_HRS='0'
EXTRN_MDL_SYSBASEDIR_LBCS=''
**USE_USER_STAGED_EXTRN_FILES='TRUE'
**EXTRN_MDL_SOURCE_BASEDIR_LBCS='/work/noaa/epic/role-epic/contrib/UFS_SRW_data/develop/input_model_data/FV3GFS/grib2/2019061518'
EXTRN_MDL_FILES_LBCS=''
natalie-perlin commented 1 year ago

This issue may be a larger task of documenting the cases and different ways that these wrappers could be potentially used. In its present form, these scripts do not add much to user's experience.

Having individual submission scripts could be solved by generating jobcards by the workflow and storing them in the experiment directory. Currently, these jobcards are being generated on the fly as temporary files that are used for to submit tasks via rocoto.

A couple of suggestions to approach it:

Any of these steps would allow to retire scripts in ./ush/wrappers/ directory, and make them generated dynamically, with all the system requirements and task configurations that has been already done at the earlier stage.