payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Syncing and NetCDF conversion for ESM1.5 #463

Open blimlim opened 1 month ago

blimlim commented 1 month ago

The NetCDF conversion script for ESM1.5 will be run as an archive stage userscript, where it is submitted to the queue as a PBS job to avoid holding up the simulation. If syncing is enabled in config.yaml, there's no guarantee that the conversion job will be finished by the time the syncing job is run. E.g. testing several month long simulations with syncing enabled, some month's NetCDF files were not all copied over to the syncing directory:

<sync-dir>/output000/atmosphere/NetCDF:
aiihca.paa1jan.nc  aiihca.pea1jan.nc  aiihca.pga1jan.nc

<sync-dir>/output005/atmosphere/NetCDF:
aiihca.pea1jun.nc

From what I can understand, the collation step gets around a similar issue by calling the sync step from within the collation job:

https://github.com/payu-org/payu/blob/2391485ac4af756f45ae6d4720cd263b30940ae5/payu/subcommands/collate_cmd.py#L110-L112

and but a similar approach might not be possible with the external conversion userscript. Just flagging this now so that we can discuss ideas to handle this.

jo-basevi commented 1 month ago

Postscript?

An option could be to run the archive user-script as a postscript. Postscripts are always resubmitted via qsub and are run after collation or archive, if collation is not enabled. E.g. in config.yaml:

postscript: my_script.sh

The command is run in payu here: https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/experiment.py#L834-L837

If postscript is enabled, syncing latest output is disabled as payu has no idea when the postscript finishes, so only previous outputs are synced (e.g. outputsN where N < current_run).

Passing specific environment vars such as current output dir could be done, using -v to export the environment variable to the pbs job, e.g.

postscript: -v PAYU_CURRENT_OUTPUT_DIR my_script.sh

Sync Userscript?

So the sync job has a userscript option which runs before any rsync commands are run. It means that nothing will be synced if the userscript fails to run. The sync job by default runs on copyq but queue and resources can be configured, see payu postprocessing documentation. However this may not work for this esm1.5 script as it might need to run every time- not just when sync is configured?

Adding Dependency Logic to Payu?

Using PBS Job Dependency with -W depend=... in the sync call could be a improvement, as it will allow latest outputs to be synced. This will require getting the Run ID and then passing that to the sync submit code.

Getting the run_id from a postscript call would be straight forward as that should be the only value returned. On the other hand, finding a run_id in a userscript output might be more difficult unless it was assumed the last value in the output, and would need to check if it's a valid run_id.

Passing it the sync call would require adding an argument to the payu sync command. Then could add the -W depend=... to the pbs_flags in pbs_config: https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/subcommands/sync_cmd.py#L61

Or add it somewhere in pbs_config, and then build it the -W depend= in the PBS-specific submit(): https://github.com/payu-org/payu/blob/e9bd1f4c4adc223e818c71f8d45c36de98025dc4/payu/schedulers/pbs.py#L26

A thing I am not too sure on is what dependency type to use. afterok:<run_id> will only run after job has terminated with no errors while afterany:<run_id> will run after job has terminated.

I think a big issue could be if a dependency is not met, the job stays in the queue indefinitely. So using afterok:<run_id> where <run_id> is a job that exits with an error, then this dependency condition will never be met. So I think dependency types might have to be after/before rather than afterok/beforeok should be used to prevent jobs being held indefinitely. I can't see a good solution of preventing sync job if prerequisite pbs job exits with an error.

There is also caveats in NCI docs for Job Dependencies. Say if job2 was submitted with after_ok:<job1> and if at the time of submission, job1 had already finished successfully. Then the job scheduler has no knowledge of job1, so job2 would be held. The docs suggest to submit job2 first with depend=on:1 then submit job1 with depend=beforeok:<job2>. My initial thoughts are that I don't want to submit a sync job first with a depend=on, and then there's some errors before other jobs are even submitted and then the sync job is stuck on the queue. There's also some suggestions in this openpbs forum post to hold job1 then submit job2 with afterok, then release job1.

I think it'll be more straightforward to implement dependency logic with postscript rather userscripts. As postscript is submitted in code just before a sync job is submitted, it's probably unlikely that the postscript would complete before sync is submitted. Userscripts would need logic to check if a qsub call was even run, then extract the run_ID, and by time sync is called which could be in a collate job, would have to check if any userscript qsub run ids are still queued/running, and if so add a depend= to sync submissions.

Hope this makes sense.. I have only really learnt about configuring PBS dependency today so I could fully be wrong..

blimlim commented 1 month ago

Thanks for writing these explanations up, these look like great ideas.

An option could be to run the archive user-script as a postscript

I think this sounds like a good option, especially if it makes it viable down the line to add in dependency logic.

(sync userscript) However this may not work for this esm1.5 script as it might need to run every time- not just when sync is configured?

We'll want the ESM1.5 netcdf conversion to run both when the syncing is and isn't enabled, and so I agree that this option wouldn't work in every case.

I think a big issue could be if a dependency is not met, the job stays in the queue indefinitely. So using afterok:<run_id> where <run_id> is a job that exits with an error, then this dependency condition will never be met. So I think dependency types might have to be after/before rather than afterok/beforeok should be used to prevent jobs being held indefinitely. I can't see a good solution of preventing sync job if prerequisite pbs job exits with an error.

I found the online documentation about the different dependency types a bit unclear about how they actually behave. I ran a quick test on gadi for the afterok condition, setting off a job which would fail after waiting 200 seconds:

$ qsub fail_job.sh
$ 121578665.gadi-pbs
fail_job.sh
----------------------
#PBS -l ncpus=1 
...
sleep 200s
exit 1

and then a dependent job which just prints out a message:

$  qsub -W depend=afterok:121578665 dependent-job.sh 

The dependent job was held in the queue


gadi-pbs: 
                                                                 Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
121578665.gadi-pbs   sw6175   normal-* fail-job.* 24520*   1   1  1024m 00:05 R 00:00
121578679.gadi-pbs   sw6175   normal-* dependent*    --    1   1  1024m 00:05 H   -- 

but looks like it got deleted once the first job failed

$ qstat -u sw6175
$

and so I'm wondering if this condition could work for holding the sync jobs? We'd probably need some way of reporting to the user if the sync job didn't run though, but I'm not sure of any good ways to do that.

I think it'll be more straightforward to implement dependency logic with postscript rather userscripts. As postscript is submitted in code just before a sync job is submitted, it's probably unlikely that the postscript would complete before sync is submitted. Userscripts would need logic to check if a qsub call was even run, then extract the run_ID, and by time sync is called which could be in a collate job, would have to check if any userscript qsub run ids are still queued/running, and if so add a depend= to sync submissions.

This makes a lot of sense. The postscript option sounds a lot easier and so I'll test out running the conversion as a postscript.

blimlim commented 1 month ago

I've just tried running the conversion as a postscript. It looks like it works well when the storage and project settings are hardcoded into the job submission script, and the set_userscript_env_vars() is called first in the postprocess step. We'll want to use the project and storage values from the simulation for the conversion job though, which I think might require the shell to be allowed when submitting the postscript.

E.g. adding the needs_subprocess_shell function to the postprocess step:

    def postprocess(self):
        """Submit any postprocessing scripts or remote syncing if enabled"""
        self.set_userscript_env_vars()
        # First submit postprocessing script
        if self.postscript:
            envmod.setup()
            envmod.module('load', 'pbs')

            cmd = 'qsub {script}'.format(script=self.postscript)

            if needs_subprocess_shell(cmd):
                sp.check_call(cmd, shell=True)
            else:
                sp.check_call(shlex.split(cmd))
...

Lets the following run ok:

postscript: -v PAYU_CURRENT_OUTPUT_DIR -P ${PROJECT} -lstorage=${PBS_NCI_STORAGE}+gdata/access+gdata/hh5  ./scripts/NetCDF-conversion/UM_conversion_job.sh

Would these changes be ok to add? There could be a better way of passing the project and storage values to the postscript which I might have missed though.

jo-basevi commented 1 month ago

so I'm wondering if this (afterok) condition could work for holding the sync jobs? We'd probably need some way of reporting to the user if the sync job didn't run though, but I'm not sure of any good ways to do that.

Oh that's awesome! There are log files that are created if sync is run, but that requires the user to check those sync logs exist.. Sync could still use after and if postscript was configured, maybe read the postscript logs for an exit status and abort with some error message. I guess it still has a similar issue that the only way you know a sync failed is by reading the log files. So in a way, not having sync log files with afterok is actually more visible that something went wrong? Maybe there should a payu command that parses log files from the various run/collate/postprocess/sync jobs and displays a graph with their exit status values..

Would these changes be ok to add? There could be a better way of passing the project and storage values to the postscript which I might have missed though.

Yay that postscript almost works! I reckon those changes will be good to add

aidanheerdegen commented 1 month ago

Great discussion of the issues!

It's good postscript works, but I'm not sold on having this in a released config

postscript: -v PAYU_CURRENT_OUTPUT_DIR -P ${PROJECT} -lstorage=${PBS_NCI_STORAGE}+gdata/access+gdata/hh5  ./scripts/NetCDF-conversion/UM_conversion_job.sh

That is some code spaghetti.

An option that has not been suggested so far is to add a collate method for the UM driver that does the um2nc conversion. It's not collation in the true sense, but it is a post-processing operation that needs to be done at a very similar time in the model life-cycle.

We should definitely discuss before trying to implement it, but thought I'd chuck it in as a possibility.

jo-basevi commented 1 month ago

That is a good point. Another similar option to having it run in the collate job, is to add a collate stage userscript to payu?

aidanheerdegen commented 2 weeks ago

After discussing we think it can be reduced to this

postscript: -V -lstorage=${PBS_NCI_STORAGE} ./scripts/NetCDF-conversion/UM_conversion_job.sh

Maybe payu could also have an option to automatically add -V -lstorage=${PBS_NCI_STORAGE} as that seems like a pretty common and useful thing: to access all the same projects as the calling job, and have access to the conda environment payu itself is running in.

aidanheerdegen commented 2 weeks ago

I'm actually starting to think we should turn the delete option on by default because most users wouldn't really have the ability to do detailed checks, and then the instructions to turn it off are clearer because it is commenting out an option that is already there.

Otherwise we risk users naively running and filling the disk with a large amount of useless data.

If we didn't delete them would the um fields fields be sync'ed if syncing was turned on? I think they probably would because we're not specifying any wildcards to ignore (@jo-basevi can advise on how to do this).

jo-basevi commented 2 weeks ago

Maybe payu could also have an option to automatically add -V -lstorage=${PBS_NCI_STORAGE} as that seems like a pretty common and useful thing: to access all the same projects as the calling job, and have access to the conda environment payu itself is running in.

Sorry I think I made a mistake yesterday, it would not have the access to the conda environment. During the run PBS job, payu is run using the path to the python executable and path to where payu is installed -

qsub .... /g/data/vk83/prerelease/apps/payu/dev/bin/python /g/data/vk83/prerelease/apps/payu/dev/bin/payu-run

As it never actually loads the payu environment on the payu-run call, the environment variables would not be set..

We could add access to the python executable to environment of the postscript call - similar to this PR - https://github.com/payu-org/payu/pull/491? In a way it could be useful to load the payu module in the PBS payu-run job to set the environment variables but I am unsure of a clean way to do it without specifying a payu modulepath and filename in config.yaml.

blimlim commented 2 weeks ago

Some related updates on using the -V flag with qsub. It doesn't seem to actually pass forward the whole environment, especially with PATH variables. I tried loading in payu on the login node, running a job with the -V flag, and comparing the two environments:

Outside of PBS job:

PATH=/g/data/vk83/prerelease/apps/payu/dev/bin:/home/565/sw6175/perl5/bin:/home/565/sw6175/.local/bin:/home/565/sw6175/bin:/opt/pbs/default/bin:/opt/nci/bin:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/default/bin

MODULEPATH=/g/data/vk83/prerelease/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_monash:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles

Inside the PBS job:

PATH=/home/565/sw6175/perl5/bin:/opt/pbs/default/bin:/opt/nci/bin:/opt/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/local/pbs/bin

MODULEPATH=/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_monash:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles

When the -V option is included, it seems to interfere with loading the payu module in the PBS job. I tested two job scripts, one which uses/loads the payu module, and one which doesn't, and tested them both with and without the -V flag:

load_payu.sh
---------------
#!/bin/bash

module list
module use /g/data/vk83/prerelease/modules
module load payu
which payu

echo "myvar: $myvar"
esm1p5_convert_nc -h
noload_payu.sh
----------------
#!/bin/bash
module list
which payu
echo "myvar: $myvar"

esm1p5_convert_nc -h

STDERR files in each case: qsub -V load_payu.sh:

Currently Loaded Modulefiles:
 1) pbs   2) payu/dev  
/usr/bin/which: no payu in (/home/565/sw6175/perl5/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/local/pbs/bin)
/local/spool/pbs/mom_priv/jobs/123535948.gadi-pbs.SC: line 14: esm1p5_convert_nc: command not found

qsub -V noload_payu.sh:

Currently Loaded Modulefiles:
 1) pbs   2) payu/dev  
/usr/bin/which: no payu in (/home/565/sw6175/perl5/bin:/opt/Modules/v4.3.0/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/local/pbs/bin)
/local/spool/pbs/mom_priv/jobs/123535949.gadi-pbs.SC: line 12: esm1p5_convert_nc: command not found

The only one that worked was to omit the -V flag, and to use/load payu within the job script, which had: qsub load_payu.sh STDOUT:

/g/data/vk83/prerelease/apps/payu/dev/bin/payu
myvar: 
usage: esm1p5_convert_nc [-h] [--quiet] [--delete-ff] current_output_dir

positional arguments:
  current_output_dir  ESM1.5 output directory to be converted
...
blimlim commented 2 weeks ago

I thought one way to simplify the script call would be to move the PAYU_CURRENT_OUTPUT_DIR to a command line argument for the job script, and adding in a second argument with the --delete-ff flag. I.e. in config.yaml:

postscript: -lstorage=${PBS_NCI_STORAGE}   ./scripts/NetCDF-conversion/UM_conversion_job.sh $PAYU_CURRENT_OUTPUT_DIR  # -delete-ff

and then in scripts/NetCDF-conversion/UM_conversion_job.sh:

module use /g/data/vk83/prerelease/modules
module load payu

esm1p5_convert_nc "$@" 

which would have the benefit of controlling the delete-ff flag directly in the config.yaml.

This gives a qsub usage error.

usage: qsub ...
    [-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version

Not sure if I'm interpreting this correctly, but does it look like command line arguments are only allowed when submitting a command instead of a script?

If we had the payu module loaded into the payu-run job, do you think it would be possible to omit the scripts/NetCDF/UM_conversion_job.sh script from the configuration completely? And instead call the conversion via

postscript: -lstorage=${PBS_NCI_STORAGE}   --  esm1p5_convertnc $PAYU_CURRENT_OUTPUT_DIR  # -delete-ff

With the above, we might have to add the walltime and memory settings and so I'm not sure what's simplest!

I might be incorrect about quite a bit of this though!

jo-basevi commented 2 weeks ago

Some related updates on using the -V flag with qsub. It doesn't seem to actually pass forward the whole environment, especially with PATH variables. I tried loading in payu on the login node, running a job with the -V flag

That's really interesting, maybe module loading payu as part of payu-run may not be the best idea..

If we had the payu module loaded into the payu-run job, do you think it would be possible to omit the scripts/NetCDF/UM_conversion_job.sh script from the configuration completely?

I guess one way to test it would be to add payu to modules loaded in config.yaml, e.g.

modules:
  use: 
     - /g/data/vk83/prerelease/modules
  load:
    - payu/dev

Though I don't think this is a great idea having to specify a payu-version.

Another way to avoid loading the payu module in the postscript, could be to pass -v $PAYU_PATH to script (which should be /g/data/vk83/prerelease/apps/payu/dev/bin for payu/dev). So to run the script, use:

 $PAYU_PATH/python $PAYU_PATH/esm1p5_convert_nc
aidanheerdegen commented 2 weeks ago

This PR also kinda solves the same problem. @blimlim is kinda manually testing this. If it works we should merge and test that.

https://github.com/payu-org/payu/pull/491