Steven's fixes for checkpoint/restart, addressing issue #750.
After multiple restarts of a task that exceeds num_tries, the workflow should be put in a Failed state. I'm not sure if this should be the Archived state instead, which I think happens with other workflows?
The checkpoint-restart tests can be run locally, after starting beeflow, with ./ci/integration_test.py -t checkpoint_restart_failure,checkpoint_restart --timeout 10000 (this might take >10 minutes).
Steven's fixes for checkpoint/restart, addressing issue #750.
After multiple restarts of a task that exceeds
num_tries
, the workflow should be put in aFailed
state. I'm not sure if this should be theArchived
state instead, which I think happens with other workflows?The checkpoint-restart tests can be run locally, after starting beeflow, with
./ci/integration_test.py -t checkpoint_restart_failure,checkpoint_restart --timeout 10000
(this might take >10 minutes).