lanl / BEE

Other
14 stars 3 forks source link

Task Manager: Cancelling a workflow should kill all associated slurm jobs #689

Open rstyd opened 1 year ago

rstyd commented 1 year ago

Our initial plan was that when cancelling a workflow all tasks should just keep running on the task manager side. This requires a lot less work on our part, but it can cause some issues for users. For example, a long running task might eat up time they have available on a partition. Or, a workflow with many small independent tasks could make it annoying having to stop each individual one. Additionally, a job can fail when the user isn't actually aware of it (e.g. one of the tasks fail), it could take the user quite a while before they realize and manually cancel the slurm jobs.

I propose we modify the workflow manager to add a cancel workflow endpoint which will remove all jobs currently in the submit queue for a specified workflow, and call the slurm worker cancel_task function on each task currently in the submit queue that belongs to the cancelled workflow.

pagrubel commented 1 year ago

@rstyd We should give an option for the user to let running jobs continue.

pagrubel commented 9 months ago

We need a definite plan for cancelling workflows. Right now the jobs continue and the workflow. Right now if a workflow is cancelled, jobs continue to run, but the states are at whatever point they were when cancelled. If we allow jobs to continue we need to have a Workflow "Cancelling State" until they are completed, and probably archive the workflow since some of it may have completed.