lanl / BEE

Other
13 stars 3 forks source link

Fix Checkpoint/Restart #776

Closed jtronge closed 4 months ago

jtronge commented 4 months ago

Steven's fixes for checkpoint/restart, addressing issue #750.

After multiple restarts of a task that exceeds num_tries, the workflow should be put in a Failed state. I'm not sure if this should be the Archived state instead, which I think happens with other workflows?

The checkpoint-restart tests can be run locally, after starting beeflow, with ./ci/integration_test.py -t checkpoint_restart_failure,checkpoint_restart --timeout 10000 (this might take >10 minutes).