Closed pagrubel closed 4 months ago
These are the entries at the end of the task manager log:
[2023-12-04 17:24:09,374] beeflow.task_manager:update_jobs(): state: TIMEOUT
[2023-12-04 17:24:09,375] beeflow.task_manager:update_jobs(): TIMELIMIT/TIMEOUT task_checkpoint: {'file_path': 'checkpoint_output', 'num_tries': '3', 'container_path': 'checkpoint_output', 'enabled': 'True', 'restart_parameters': '-R', 'file_regex': 'backup[0-9]*.crx'}
[2023-12-04 17:24:09,544] beeflow.task_manager:update_task_state(): WFM not responding when sending task update.```
Steven fixed the retries type issue. It sounds like we need to go to wf_update.py. The workflows are getting stuck in the timeout state when they run out of retries, but they should move to the "fail" state. Currently the branch he has to fix the issue is working to do the restart, but it just gets stuck on this last state update.
Addressed in #776
Using the workflow in beeflow/data/cwl/bee_workflows/clamr-wf-checkpoint that worked previously. Querying the status of the workflow gives:
The TIMEOUT state should spawn restart tasks.