payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Propagating command line flags PBS is unnecessarily complex #258

Open aidanheerdegen opened 4 years ago

aidanheerdegen commented 4 years ago

The bug fixed by PR #257 prompted me to think about how command line arguments are propagated to PBS jobs.

Currently command line arguments must converted into environment variables, exported to the submitted command, and then checked again in the called process.

This could be dispensed with as command line arguments can be passed directly on the command line with qsub.

There might still be some requirement for environment variables to pass information between processes, but it would be cleaner if all command line arguments are just propagated into the payu-run call (and payu-collate and payu-profile).

Not sure how this plays with slurm though.

marshallward commented 4 years ago

I agree that passing environment variables is bulky and volatile. It is easy to get this wrong: pass too many or two few and major chaos can ensue.

It's hard for me to remember the reasoning here, but I recall having very few options. It could be that those qsub flags did not exist at the time, for example.

I think we ought to look more into this, it may be a platform-dependent solution as you mention. At the very least, this "encode/decode" probably should be unseen by the rest of the code.

aidanheerdegen commented 4 years ago

I think the ability to pass command line arguments post dates the development of payu, but not sure when it came in TBH.

It looks like sbatch (slurm) does support it

https://stackoverflow.com/a/46328955

To be specific, I am suggesting that calls like this one:

https://github.com/payu-org/payu/blob/master/payu/subcommands/run_cmd.py#L105

also pass in the command line arguments and they are literally appended to payu-run command.

The current method sets up different code path/variable setting depending how it is called. This makes it trickier to test as well.

Given sbatch appears to support this I'll work up a PR, but it would be good if you could confirm that your slurm install will work @marshallward, using a simple test script. I used this on gadi:

qsub -q express -l wd -l ncpus=1 -l mem=1gb -l walltime=00:01:00 -- ls -la
marshallward commented 4 years ago

Don't know how I missed this message... don't worry about slurm for now, I've stepped away from that goal for the moment.