workflows status update issue: DB says 'running' even though K8S pod failed

tiborsimko commented 2 years ago

Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod run-batch-... died, but the workflow is still reported as being "running" in the DB.

Here are logs from job status consumer for one such example:

$ kubectl logs reana-workflow-controller-7cf76bc4d9-gvxfs job-status-consumer | grep -C 10 wwwwww
2022-01-11 18:47:15,763 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:02,674 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:48:49,676 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:49:05,179 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,842 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.running
2022-01-11 18:52:26,884 | root | MainThread | INFO |  [x] Received workflow_uuid: wwwwww status: RunStatus.failed
du: cannot access '/var/reana/users/uuuuuu/workflows/wwwwww/workflow_ph/pythia_16/mg_processes/signal/Events/run_01/unweighted_events.lhe': No such file or directory
2022-01-11 18:52:34,644 | root | MainThread | ERROR | Unexpected error while processing workflow: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 104, in on_message
    _update_workflow_status(workflow, next_status, logs)
  File "/usr/local/lib/python3.8/site-packages/reana_workflow_controller/consumer.py", line 142, in _update_workflow_status
    Workflow.update_workflow_status(Session, workflow.id_, status, logs, None)
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 664, in update_workflow_status
    raise e
  File "/usr/local/lib/python3.8/site-packages/reana_db/models.py", line 659, in update_workflow_status
    workflow.status = status
  File "/usr/local/lib/python3.8/site-packages/sqlalchemy/orm/attributes.py", line 279, in __set__
--
  File "/usr/local/lib/python3.8/site-packages/reana_db/utils.py", line 273, in get_disk_usage_or_zero
    disk_bytes = get_disk_usage(workspace_path, summarize=True)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 305, in get_disk_usage
    disk_usage_info = get_disk_usage_info_paths(directory, command, name_filter)
  File "/usr/local/lib/python3.8/site-packages/reana_commons/utils.py", line 270, in get_disk_usage_info_paths
    disk_usage_info = subprocess.check_output(command).decode().split()
  File "/usr/local/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/local/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['du', '-s', '-b', '/var/reana/users/uuuuuu/workflows/wwwwww']' returned non-zero exit status 1.

Note the exception due to du command. As a result, the status of workflow wwwwww wasn't updated in the DB, and it is still reported as "running" there.

Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the du style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.

VMois commented 2 years ago

Possibly related issue - reanahub/reana-commons#303

VMois commented 2 years ago

The fastest solution for this particular issue is to catch exceptions in the workflow_status_change_listener function in reana_db/models.py. This looks like the only hook on the Workflow model.

Ensuring that we do not have any such exception and DB always equals k8s pod status will require some more work.

I will prepare a PR to fix workflow_status_change_listener and, also, I will check job-status-consumer and where we might have problems with an inconsistent state.

VMois commented 2 years ago

Instead of catching exceptions in the event listener, another possibility is to improve consumer.py.

In case the event listener fails and aborts DB commit, we can catch this in consumer.py and re-queue a message. This assumes that the error will be gone on the next try. In addition, we will also need to implement the mechanism to check how many times the message was retried and do something when a limit is reached. (can be related to #423)

I think an easier solution is better for now. But, in case of error, disk quota will only be updated on the next workflow run or by nightly updater which might not be optimal.

reanahub / reana-workflow-controller

workflows status update issue: DB says 'running' even though K8S pod failed #430