Open pflarr opened 6 years ago
Looking at 3 main points of interaction with the scheduler plugins.
After having collected the inputs and populated all of the other variables, provide a desired partition, state, minimum number of nodes, max number of nodes, min number of processors per node, maximum number of processors per node, and whether the job needs to run immediately or if it can wait. An exception will be thrown if not enough nodes are available. The maximum number of nodes requested can be 'all'. The checks are performed by collecting the data provided by scontrol show node
for all nodes and iterating through that information. If all of the checks are passed, a tuple of number of nodes and the number of processors per node that should be used in the batch script.
When the main program has decided what it wants to request as resources, the partition, reservation, qos, account, number of nodes, processors per node, and time limit should be passed to this function and it will return a list of strings that should follow immediately after the she-bang to specify the resources. The main program can then use that to compose the submission script.
Finally, when the main program has written the script and is ready to submit it, the class can return the submission invocation call (e.g. - sbatch
for slurm). The call is returned as a single string to put in the subprocess call.
These are the main parts that the scheduler is differentiated by and therefore responsible for. If further functions are required for querying the queue and status of the job, these should be fleshed out.
Another commit has changed this slightly.
*3. The scheduler class and subclasses now have a 'submit_job' function that takes a path to the submission script and submits the job to the scheduler. It also returns the job ID.
Scheduler Plugins
The process of running and scheduling jobs is as follows. Steps that actually involve the scheduler are in bold:
For schedulers that actually schedule jobs on a cluster, the kickoff script is expected to run on an allocation sized to the largest test it expects to run. The tests themselves should run on pieces of that allocation scheduled within itself. This may not be possible on all schedulers, but is for slurm (and probably Moab). The kickoff script does the following:
pav do_build
command for the test. b. Issues thepav do_run
command to run the test.pav status
command) before and after each step.The
pav do_run
command does the following.