lanl / BEE

Other
14 stars 3 forks source link

Task Manager Resiliency: Check the status of any tasks that were running via slurm. Resubmit any failed tasks. #675

Closed aquan9 closed 4 months ago

aquan9 commented 1 year ago

Pieces broken up from #614

pagrubel commented 1 year ago

Should this be optional or automatic? Should there be a configuration option that only tries x number of times?

jtronge commented 8 months ago

I wonder if checkpoint-restart would affect this? We might want to only resubmit tasks if the num_tries in beeflow:CheckpointRequirement allows it. I'm not sure if the number of times a task has already been restarted is stored in a database, so we might lose it on task manager failure.

pagrubel commented 8 months ago

I think this issue needs some discussion and clarification, maybe during a meeting.

jtronge commented 4 months ago

Resolved with PR #827