ufs-community / ufs-weather-model

UFS Weather Model
Other
129 stars 238 forks source link

Rerunning regression test by submitting job_card in the run directory fails #2247

Closed DusanJovic-NOAA closed 2 weeks ago

DusanJovic-NOAA commented 3 weeks ago

Description

During the development and/or debugging I often need to keep rerunning the same test over and over again. The simplest way of doing this is to run the desired regression test once, save the run directory, move that directory to somewhere where it will not be purged (noscrub directory for example) and then submit the job card from that run directory. This workflow is currently broken after the last PR that cleaned up rt.sh.

To Reproduce:

  1. Run one regression test, for example ./rt.sh -n 'control_p8 intel' -k
  2. Move the run directory to different location, mv ../rt_xxxxxxxx/control_P8_intel /new/location/control_p8_intel
  3. Go to the run directory, cd /new/location/control_p8_intel
  4. Submit the job sbatch job_card

This will first fail with the error: MACHINE_ID: unbound variable. This happens because MACHINE_ID=hera is commented out in the job_card.

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/fv3_conf/fv3_slurm.IN_hera#L17-L23

After uncommenting that line and resubmitting job card it fails again with:

++ date +%s
+ date_s_start=1713801710
+ echo -n 1713801710,
+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"modules.fv3"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "modules.fv3"

Also make sure that all modulefiles written in TCL start with the string
#%Module

This error happens because in the job_card the module use location is hard-coded to the original run directory:

MACHINE_ID=hera
source ./module-setup.sh
module use "/scratch1/NCEPDEV/stmp2/Dusan.Jovic/FV3_RT/rt_1691993/control_p8_intel/modulefiles"
module load modules.fv3
module list

I am not submitting this job from that original location, I moved the entire run directory to a different location. I see these job_card template changes are made only on Hera, all other job cards are untouched. Why?

Additional context

Output

BrianCurtis-NOAA commented 3 weeks ago

@DusanJovic-NOAA looks like I made a change from $PWD/modulefiles to @[PWD]/modulefiles which should not have happened. If you change fv3_slurm.IN_hera to use $ instead of @[] for line 20 I think that would solve this issue. I have to fix an issue using -l and -n together too that I will try to bring this fix into.

BrianCurtis-NOAA commented 3 weeks ago

@DusanJovic-NOAA try this please: https://github.com/BrianCurtis-NOAA/ufs-weather-model/tree/fix_rtsh_bugs

DusanJovic-NOAA commented 3 weeks ago

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates. Also srun like now looks like:

srun --label -n "160" ./fv3.exe

Why is number of tasks (-n option) quoted string, it should be an integer. In the job card template , this line:

srun --label -n @[TASKS] ./fv3.exe

has been changed to:

srun --label -n "@[TASKS]" ./fv3.exe
DusanJovic-NOAA commented 3 weeks ago

OMP_NUM_THREADS is now also quoted string, it should be an integer:

export OMP_NUM_THREADS="@[THRD]"

Why did you quote all these at-parse-placeholders? Only in hera templates, in all other templates they are as before.

BrianCurtis-NOAA commented 3 weeks ago

Actually, thanks for bringing that up. The quoted strings using the at-parse variables do not need the quotes, you're right. The hera job card is a test for switching all of those fv3/compile IN files to bash. Having one in development helps me see what mistakes this might be. Bash likes quoted variables but with using the at-parse it's not needed as bash doesn't see it as anything.

I'll get those quotes removed in that branch.

BrianCurtis-NOAA commented 3 weeks ago

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates.

Bash likes variables that are not used directly in the script to be exported to make sure they are used externally properly. If the export is meaningless and the variable is not used in the script, then it should be removed.

DusanJovic-NOAA commented 3 weeks ago

Unrelated to the job card template issues, but now by default build.sh runs the make command with VERBOSE option turned on. And I can not turn it off now. The following change in build.sh, from:

OMP_NUM_THREADS=1 make -j ${BUILD_JOBS:-4} VERBOSE=${BUILD_VERBOSE:-}

to:

OMP_NUM_THREADS=1 make -j "${BUILD_JOBS:-4}" "VERBOSE=${BUILD_VERBOSE:-1}"

broke that. Please remove that 1 at the end, by default BUILD_VERBOSE should be empty.

And again -j make option accepts integers not quoted strings (BUILD_JOBS).

DusanJovic-NOAA commented 3 weeks ago

Ok, that seems to work. Why is MACHINE_ID exportred, previously it wasn't? And it isn't in the other job card templates.

Bash likes variables that are not used directly in the script to be exported to make sure they are used externally properly. If the export is meaningless and the variable is not used in the script, then it should be removed.

But in this case MACHINE_ID is used in this script, it is used by module-setup.sh, which is sourced. So MACHINE_ID does not have to be exported. Please remove export. After all it isn't exported in any other job card template, and everything works without any issues.

BrianCurtis-NOAA commented 3 weeks ago

I've removed the quoted at-parse variables. I've removed the BUILD_VERBOSE extra 1 make -j option does accept quoted strings For MACHINE_ID being exported in the job_card: What shellcheck expects is that if a variable is set in a shell script file, that it's also used in that same shell script file. It seems logical to me to export set variables into the environment if the script does not use that variable, and it's what shellcheck wants to see. This is consistent now across all shell scripts that shellcheck lints. It seems good practice. I am open to potential issues that might arise by doing this, if you have any.

DusanJovic-NOAA commented 3 weeks ago

Then why is MACHINE_ID not exported in other job card templates?

BrianCurtis-NOAA commented 3 weeks ago

Then why is MACHINE_ID not exported in other job card templates?

I haven't converted those to shellcheck, yet. Hera was a test to see if it had issues with the move from /bin/sh to /bin/bash and the changes that shellcheck wanted. I tested as much as I could on Hera but was not sure with using the at-parse if it would have any negative impacts for other projects. With Hera getting the most usage externally, it seemed a great place to introduce the change. The intent is to convert them all over if Hera does not see major issues.

BrianCurtis-NOAA commented 2 weeks ago

These were fixed in #2241