Open pagrubel opened 2 months ago
I need to do some overnight testing on this branch.
To test this for CLI (on darwin) you may want two screens both in the poetry env for this branch:
watch -n 2 query <wf_id>
to make sure the task is pending or running.
It is helpful to do this in a separate screen and keep it up even when you stop beeflow, or when you restart itsqueue -u <username>
or the task manager logbeeflow core stop
while clamr step is runningwatch -n 5 show job <job_id>
Wait until this gives a no job type errorbeeflow core start
. The query screen will show that clamr has completed.beeflow resume <wf_id>
To test for slurmrestd:
watch -n 2 query <wf_id>
to make sure the task is pending or running.
It is helpful to do this in a separate screen and keep it up even when you stop beeflow, or when you restart itsqueue -u <username>
or the task manager logbeeflow core stop
while clamr step is runningwatch -n 5 squeue -u <userid>
Wait until the clamr job is off the screenbeeflow core start
. The query screen will show that clamr has completed.beeflow resume <wf_id>
Adding this thought here to implement later: The output from "sacct -j
@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.
@jtronge I tested this on the prod system and it works. The only additional feature we may want to add at some point is for the user to be able to resume all paused workflows with one command.
@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.
Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch WorkerError
and remove the job in that case.
@jtronge What happens if sacct throws an exception with job_state? I was hoping to get it out of the queue but not sure it will be.
Oh sorry, just saw your question. I think this depends on what the calling code is doing. The TM background code could try to catch
WorkerError
and remove the job in that case.
Should we put WIP back on do this. I'm not sure there is any way to test it.
This PR pauses any running or waiting workflows when stopping beeflow. Tasks/jobs that are running or pending when beeflow is stopped will be updated, if possible, once beeflow starts up again. Addresses #783