Open aidanheerdegen opened 5 years ago
Ping @kinow @marshallward
Sorry for lack of reply, due to my relocation, though I of course support this completely!
Only thing worth mentioning here is that GFDL machines have just dropped Moab for Slurm, and NCI is somewhat likely to adopt Slurm (with a possible PBS wrapper) so Slurm is an obvious target to look into this.
One design goal to consider is a somewhat interactive Scheduler
class, and how it might behave within, say, the Python shell. Currently Experiment
is very procedural, i.e. run()
= "do a sequence of steps and exit", and it would be good to start replacing these with objects that have some sense of state and can be controlled by a user. A Scheduler
seems like a good place to start thinking about this.
Some useful links
https://arc-ts.umich.edu/migrating-from-torque-to-slurm/
Managing HPC workflows using Apache Airflow:
https://www.astro.caltech.edu/ai19/talks/Nourbakhsh.pdf
A survey of workflow management systems
Another command executation tool with flow in the name, but supports Torque/PBS out of the box
Python wrapper to C PBS libraries
https://oss.trac.surfsara.nl/pbs_python
Galaxy (python bioinformatics) job runner support
https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners
Distributed Resource Management Application API (DRMAA)
python bindings to DRMAA
Other Python projects that solve the same/similar issue:
Those are interesting, and very mature, projects.
As a general rule I would steer clear of anything that requires a central DB or persistent services. They sure are nice to have, but getting NCI to run that stuff is not straightforward.
The coupled models use cylc for running their models. It is a workflow engine that uses a DAG to define the work to be done, with dependencies etc. However it requires a persistent daemon, and web services, which require resources from NCI, and resources to support.
We could have transitioned to use cylc
, but it is a lot more complex, and locks users into a certain way of working. We generally preferred the lower overhead approach of payu
.
I mentioned these projects because they include code to handle different types of schedulers. My suggestion would be to copy some of that code, as the licenses are compatible.
For example, here is how Fireworks handles different schedulers:
https://github.com/materialsproject/fireworks/tree/main/fireworks/user_objects/queue_adapters
and here is the corresponding code for Aiida:
https://github.com/aiidateam/aiida-core/tree/main/aiida/schedulers
I agree that requiring a database is too much for the users of payu, and I wasn't suggesting to follow that route. But since you mention it, normally one does not setup the Aiida or Fireworks databases on the HPC clusters. These are normally setup on a personal computer or on a server, because both projects allow you to submit jobs to more than on cluster at a time.
PS: thanks for mentioning cylc, I didn't know about it. Very interesting!
I think both Aiida and Fireworks look really promising! It'd be grand to have more workflow managers using the same code for batch schedulers. And I agree that adding cylc would bring some overhead (*).
I'm working on CWL & cwltool at the moment, not using any job schedulers, but will continue lurking here as I may need to use a job scheduler with Python again in another project soon.
-Bruno
(*) only minor point is that cylc works with cyclic graphs
Another option avoiding reinventing the wheel: https://github.com/goerz/clusterjob
Could be used as a dependency of payu.
It would be good to create a scheduler class to make it easier to support more batch queue systems than just PBS.
There was some discussion around this in a PR https://github.com/payu-org/payu/pull/181
Also these are related Issues:
https://github.com/payu-org/payu/issues/66
https://github.com/payu-org/payu/issues/43