glotzerlab / signac-flow

Workflow management for signac-managed data spaces.
https://signac.io/
BSD 3-Clause "New" or "Revised" License
48 stars 37 forks source link

Return scheduler job ids from FlowProject.submit #543

Open bdice opened 3 years ago

bdice commented 3 years ago

Feature description

Requested by user @salazardetroya: https://signac.slack.com/archives/CVC04S9TN/p1623794700095400

Whenever I submit a job with sbatch ... I typically obtain the job ID as output. I'd like to obtain that job ID using signac-flow.

This would enable complex submission workflows through something like the following snippet:

from flow import FlowProject

class Project(FlowProject):
    pass

project = Project()
scheduler_job_ids = project.submit(...)

# Wait until the last one of the previous jobs have completed
more_job_ids = project.submit(..., after=scheduler_jobs_ids[-1])

Proposed solution

We used to (partially) support this kind of behavior for PBS/Torque clusters but we did not implement it for SLURM. If we chose to support this feature, we would need to implement it for all schedulers so that we have a consistent API. See here for the past implementation (removed in 0.12): https://github.com/glotzerlab/signac-flow/blob/29afbe3748019abd6a220a0b177e0ee1e853e8e6/flow/scheduling/torque.py#L149-L155

I believe that one possible issue with this approach is that I'm not sure if all clusters behave the same. Some clusters might print other messages / info via stdout / stderr that would break the parsing.

The return value of the scheduler class (the part I linked above) would need to be forwarded through a series of calling functions to the return value of FlowProject.submit. I think it might be appropriate to return a list of job ids as strings, since FlowProject.submit can call sbatch (or a different scheduler command) multiple times.

To add this feature, here are the steps I would suggest:

  1. Make the internal function _call_submit return the captured output. This applies to all schedulers. https://github.com/glotzerlab/signac-flow/blob/9d4f1b459a1ef484852e040691da78c3ba7dee32/flow/scheduling/base.py#L162
  2. Parse the output and extract the scheduler job id according to each scheduler class. Here's the line to edit for the SLURM scheduler. Ask for help if you need someone else to test schedulers for which you don't have access to a test cluster. https://github.com/glotzerlab/signac-flow/blob/9d4f1b459a1ef484852e040691da78c3ba7dee32/flow/scheduling/slurm.py#L150
  3. Change the behavior of the ComputeEnvironment class to pass through the captured scheduler job id if submission occurs (instead of JobStatus.submitted, which could be inferred by the calling functions) and None if submission didn't run or failed. https://github.com/glotzerlab/signac-flow/blob/9d4f1b459a1ef484852e040691da78c3ba7dee32/flow/environment.py#L215-L217
  4. Refactor FlowProject._submit_operations to pass through scheduler job ids, just like in the previous step. https://github.com/glotzerlab/signac-flow/blob/9d4f1b459a1ef484852e040691da78c3ba7dee32/flow/project.py#L3691-L3693
  5. Finally, change the behavior of FlowProject.submit to return job ids (and continue to update the job/operation status on success, as interpreted by the result of the above method calls). https://github.com/glotzerlab/signac-flow/blob/9d4f1b459a1ef484852e040691da78c3ba7dee32/flow/project.py#L3782-L3791
  6. Test on a system with a scheduler.
  7. Update docs.
  8. Decide whether the FlowProject CLI (python project.py submit) should print the ids returned by the FlowProject.submit method.

Additional context

Another alternative would be to just return the raw captured stdout and leave it to the user to parse that information. In that case, FlowProject.submit would return a list of strings, each containing the raw output of one call to sbatch (instead of a list of strings of parsed job ids).

vyasr commented 3 years ago

3 is partially related