payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 26 forks source link

Create a Scheduler Class #182

Open aidanheerdegen opened 5 years ago

aidanheerdegen commented 5 years ago

It would be good to create a scheduler class to make it easier to support more batch queue systems than just PBS.

There was some discussion around this in a PR https://github.com/payu-org/payu/pull/181

Also these are related Issues:

https://github.com/payu-org/payu/issues/66

https://github.com/payu-org/payu/issues/43

aidanheerdegen commented 5 years ago

Ping @kinow @marshallward

marshallward commented 5 years ago

Sorry for lack of reply, due to my relocation, though I of course support this completely!

Only thing worth mentioning here is that GFDL machines have just dropped Moab for Slurm, and NCI is somewhat likely to adopt Slurm (with a possible PBS wrapper) so Slurm is an obvious target to look into this.

One design goal to consider is a somewhat interactive Scheduler class, and how it might behave within, say, the Python shell. Currently Experiment is very procedural, i.e. run() = "do a sequence of steps and exit", and it would be good to start replacing these with objects that have some sense of state and can be controlled by a user. A Scheduler seems like a good place to start thinking about this.

aidanheerdegen commented 4 years ago

Some useful links

https://arc-ts.umich.edu/migrating-from-torque-to-slurm/

https://parsl-project.org

Managing HPC workflows using Apache Airflow:

https://www.astro.caltech.edu/ai19/talks/Nourbakhsh.pdf

A survey of workflow management systems

https://dmtn-025.lsst.io

Another command executation tool with flow in the name, but supports Torque/PBS out of the box

https://www.nextflow.io

Python wrapper to C PBS libraries

https://oss.trac.surfsara.nl/pbs_python

Galaxy (python bioinformatics) job runner support

https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/jobs/runners

Distributed Resource Management Application API (DRMAA)

https://www.drmaa.org

python bindings to DRMAA

https://github.com/pygridtools/drmaa-python

micaeljtoliveira commented 2 years ago

Other Python projects that solve the same/similar issue:

https://github.com/materialsproject/fireworks/

https://github.com/aiidateam/aiida-core

aidanheerdegen commented 2 years ago

Those are interesting, and very mature, projects.

As a general rule I would steer clear of anything that requires a central DB or persistent services. They sure are nice to have, but getting NCI to run that stuff is not straightforward.

The coupled models use cylc for running their models. It is a workflow engine that uses a DAG to define the work to be done, with dependencies etc. However it requires a persistent daemon, and web services, which require resources from NCI, and resources to support.

We could have transitioned to use cylc, but it is a lot more complex, and locks users into a certain way of working. We generally preferred the lower overhead approach of payu.

micaeljtoliveira commented 2 years ago

I mentioned these projects because they include code to handle different types of schedulers. My suggestion would be to copy some of that code, as the licenses are compatible.

For example, here is how Fireworks handles different schedulers:

https://github.com/materialsproject/fireworks/tree/main/fireworks/user_objects/queue_adapters

and here is the corresponding code for Aiida:

https://github.com/aiidateam/aiida-core/tree/main/aiida/schedulers

I agree that requiring a database is too much for the users of payu, and I wasn't suggesting to follow that route. But since you mention it, normally one does not setup the Aiida or Fireworks databases on the HPC clusters. These are normally setup on a personal computer or on a server, because both projects allow you to submit jobs to more than on cluster at a time.

micaeljtoliveira commented 2 years ago

PS: thanks for mentioning cylc, I didn't know about it. Very interesting!

kinow commented 2 years ago

I think both Aiida and Fireworks look really promising! It'd be grand to have more workflow managers using the same code for batch schedulers. And I agree that adding cylc would bring some overhead (*).

I'm working on CWL & cwltool at the moment, not using any job schedulers, but will continue lurking here as I may need to use a job scheduler with Python again in another project soon.

-Bruno

(*) only minor point is that cylc works with cyclic graphs

micaeljtoliveira commented 2 years ago

Another option avoiding reinventing the wheel: https://github.com/goerz/clusterjob

Could be used as a dependency of payu.