lanl / BEE

Other
13 stars 3 forks source link

Pause running workflows when stopping beeflow with 'beeflow core stop' #830

Open pagrubel opened 2 months ago

pagrubel commented 2 months ago

This PR pauses any running or waiting workflows when stopping beeflow. Tasks/jobs that are running or pending when beeflow is stopped will be updated, if possible, once beeflow starts up again. Addresses #783

pagrubel commented 2 months ago

I need to do some overnight testing on this branch.

pagrubel commented 1 month ago

To test this for CLI (on darwin) you may want two screens both in the poetry env for this branch:

To test for slurmrestd:

pagrubel commented 1 month ago

Adding this thought here to implement later: The output from "sacct -j " is in column form where the job_id is a row heading and Job State is a column heading. That should be used instead of the location in case another column or row is added in the future.

pagrubel commented 4 weeks ago

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

pagrubel commented 3 weeks ago

@jtronge I tested this on the prod system and it works. The only additional feature we may want to add at some point is for the user to be able to resume all paused workflows with one command.

jtronge commented 3 weeks ago

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch WorkerError and remove the job in that case.

pagrubel commented 3 weeks ago

@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.

Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch WorkerError and remove the job in that case.

Should we put WIP back on do this. I'm not sure there is any way to test it.