lanl / BEE

Other
13 stars 3 forks source link

Checkpoint/Restart is not working. #750

Closed pagrubel closed 4 months ago

pagrubel commented 7 months ago

Using the workflow in beeflow/data/cwl/bee_workflows/clamr-wf-checkpoint that worked previously. Querying the status of the workflow gives:

Running
clamr--TIMEOUT
ffmpeg--WAITING

The TIMEOUT state should spawn restart tasks.

pagrubel commented 7 months ago

These are the entries at the end of the task manager log:


[2023-12-04 17:24:09,374] beeflow.task_manager:update_jobs(): state: TIMEOUT
[2023-12-04 17:24:09,375] beeflow.task_manager:update_jobs(): TIMELIMIT/TIMEOUT task_checkpoint: {'file_path': 'checkpoint_output', 'num_tries': '3', 'container_path': 'checkpoint_output', 'enabled': 'True', 'restart_parameters': '-R', 'file_regex': 'backup[0-9]*.crx'}
[2023-12-04 17:24:09,544] beeflow.task_manager:update_task_state(): WFM not responding when sending task update.```
aquan9 commented 5 months ago

Steven fixed the retries type issue. It sounds like we need to go to wf_update.py. The workflows are getting stuck in the timeout state when they run out of retries, but they should move to the "fail" state. Currently the branch he has to fix the issue is working to do the restart, but it just gets stuck on this last state update.

pagrubel commented 4 months ago

Addressed in #776