Closed tiborsimko closed 2 years ago
Possibly related issue - reanahub/reana-commons#303
The fastest solution for this particular issue is to catch exceptions in the workflow_status_change_listener
function in reana_db/models.py
. This looks like the only hook on the Workflow
model.
Ensuring that we do not have any such exception and DB always equals k8s pod status will require some more work.
I will prepare a PR to fix workflow_status_change_listener
and, also, I will check job-status-consumer
and where we might have problems with an inconsistent state.
Instead of catching exceptions in the event listener, another possibility is to improve consumer.py
.
In case the event listener fails and aborts DB commit, we can catch this in consumer.py
and re-queue a message. This assumes that the error will be gone on the next try. In addition, we will also need to implement the mechanism to check how many times the message was retried and do something when a limit is reached. (can be related to #423)
I think an easier solution is better for now. But, in case of error, disk quota will only be updated on the next workflow run or by nightly updater which might not be optimal.
Observed several workflow status update failures on a REANA 0.8.0 cluster. The workflow runtime pod
run-batch-...
died, but the workflow is still reported as being "running" in the DB.Here are logs from job status consumer for one such example:
Note the exception due to
du
command. As a result, the status of workflowwwwwww
wasn't updated in the DB, and it is still reported as "running" there.Since we consider DB to be the single-source-of-truth for the status of workflows, it'll be good to catch all these possible problems, not only for the
du
style exception, but for any other exception that may occur, so that the workflow status in DB is properly updated and corresponds to the K8S pod status.