payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Proto-scheduler class, pre-support for slurm #271

Closed marshallward closed 4 years ago

marshallward commented 4 years ago

This is a very minimal implementation of a Scheduler class, in order to handle scheduler-specific operations.

It includes a basic Factory for controlling the Scheduler instantiation, and a single method for returning the submission command line.

I do not expect things to keep working like this, it is only meant to be the most minimal implementation to support both PBS and Slurm.

Hopefully more changes to come.

pep8speaks commented 4 years ago

Hello @marshallward! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2020-06-17 03:25:28 UTC
marshallward commented 4 years ago

I have not yet tested this on Gadi, so it may fail spectacularly on there. Will try to get it going...

coveralls commented 4 years ago

Coverage Status

Coverage decreased (-0.8%) to 41.274% when pulling 98de47d4c0e9f35b9853cf8421b0bf0d7a400e6b on proto_scheduler into bb72250baa6411cfc48053a26dec5a2f1c4c9630 on master.

marshallward commented 4 years ago

Already found one bug:

 539                 #model_prog.append('-np {0}'.format(model_ncpus))                                                                                                                                       
 540                 model_prog.append('-n {0}'.format(model_ncpus))                 

(or maybe that works? mpirun appears to support both...)

aidanheerdegen commented 4 years ago

I don't understand why coveralls says coverage reduced by 42%.

A few thoughts:

  1. With the site specific hacks required for both GFDL and NCI perhaps it would be best to create clean scheduler classes and put the site specific stuff into some site/platform class. It could contain methods to modify the mpirun command, as well as queue information maybe?

  2. This is creating code that I can't test, as I don't have access to a slurm machine, and is more difficult for you. Might be time for me to resuscitate my Pawsey account ...

  3. Some of the recent code reorganisation was to make it easier to write tests for the code. So for example have a functions that returns a command string that can be checked for correctness rather than submitting that directly. In case it wasn't obvious why I'd made those changes.

marshallward commented 4 years ago
  1. I don't understand either what's going on with code coverage, but will try to see what's happening. (So much has changed, I don't recognize the output much anymore...)

  2. I agree, we should try to nail down these site-specific hacks. An NCI scheduler which subclasses PBS is not a bad idea IMO, for example. (Maybe revive the old "ANUPBS" name?)

    For now, I think it's probably OK to just say "PBS" is synonymous with "NCI" and "Slurm" is synonymous with "GFDL".

  3. I agree (again) and think the safest thing to do is for me not to merge anything until you've tested it out, or at least signed off on it. And no need to return the favor; at this time I'm the only GFDL user.

  4. Yeah, the testing changes are a bit confusing to me at the moment (see first point), but hopefully I'll get my head around them soon.

aidanheerdegen commented 4 years ago

I have an idea for making the site specific stuff a little tidier. How about moving to a system where all arguments to stuff like mpirun and qsub is a dict or ordereddict. Then that can be returned, and checked, really easily, and also modified in a very modular way. You can delete and replace options for your MPICH wrapper, I can add storage options for our PBS system.

Rather than parsing stupid effing stringsI think this could be a winner.

marshallward commented 4 years ago

I agree with much of this, and consider most of those string checks to be stubs.

I think that a lot of the proposed structure will emerge as I get the GFDL version working correctly.