payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
20 stars 26 forks source link

Should we changed the runspersub wall time logic #479

Open aidanheerdegen opened 3 months ago

aidanheerdegen commented 3 months ago

Currently the runspersub option requires the user to make compensating modifications to the walltime requested to ensure the multiple number of runs can complete within a single PBS submit.

This has been a source of confusion in the past.

ACCESS-NRI is working up configurations for ACCESS-ESM1.5 and this model has a maximum run-time of 1 year. However it is a low-res ESM model that typically requires very long runs to equilibrate slow carbon cycling. It would be convenient to have runspersub: 20 to minimise PBS queue time and a proliferation of PBS logs.

However this would mean the default configuration would have a PBS walltime of 48hrs. For users doing short test runs this would impact their movement through the queue.

The proposal is to alter the logic so that walltime is set by the user to reflect how long it takes for a single run of the model. Then runspersub and the number of runs requested could be used to modify the requested walltime to make sure the job can complete (basically submit_walltime = min(runs, runspersub) * walltime).

This has a nice feature that runspersub can be left set to a larger number, and however many runs a user selects the submitted wall time would be adjusted up to a maximum value that is runspersub * walltime.

Clearly this would require useful informative messages to the user to let them know how the PBS submission was being altered.

There is some precedence here with the way payu pads CPU requests to be a multiple of nodes, or sets memory limits if no memory is set.

If backwards compatibility was required, or if it was clearer for users, there could be a new config option runtime which is then used to calculate walltime if walltime isn't specified.

aidanheerdegen commented 3 months ago

If backwards compatibility was required, or if it was clearer for users, there could be a new config option runtime which is then used to calculate walltime if walltime isn't specified.

If we did this it might make sense to have runtime and walltime mutually exclusive. So use either one or the other, and by default with runspersub: 1 they would have identical practical outcomes.

Calculating the final walltime for users still requires users to be aware of the maximum walltime of the queue they're using. If they modify the model config so it takes longer for a single run they would need to change runtime and runspersub otherwise they may exceed the maximum walltime of the queue. Which would waste a lot of resources.

If maxwalltime was defined in the platform config, and set to the known defaults in payu then it would just require changing runtime and payu could check the current settings were consistent.