Improve global error handling & logging

jpye-finch commented 1 year ago

Design principles

We need to handle errors in the system more gracefully. If there is any error whilst processing an application so that the data is never lost. The following approach aims to

Ensure data integrity in all situations
Provide clear signalling of process steps end-to-end
Allow the client to retry sending the same payload until data integrity and persistence can be verified
Allow the client to keep hold of the data until verified

[!NOTE]
A proof of concept of this has been implemented in this PR. The changes only apply for when a completely new record is created so the approach needs to be extended to other operations.

Requirements

Client should never purge draft unless it can verify the record was fully written. To do this, it needs to verify the record was received either by

the backend returning an unique id for the record and the client polling with this ID to see the status
the client deciding on a unique id (e.g. draft id for record creation) that is similarly used for polling

Status should never be OK and the client should never remove the local record before metadata is persisted in MongoDB, attachments are stored in Minio and search indexing happens in Elasticsearch. If any of these fails, the record stays in the queue indefinitely and alerts are sent to system admin.

If one of these steps fails, the backend needs to work so the client can safely retry without duplicate entries. In other words, the backend operations need to be idempotent. The outcome in the database should be the same even if you first unsuccessfully submit a record 10 times, then submit it successfully once and then try submitting it 5 more times. The output should be one record written in the database.

[!NOTE]
Proof of concept implements a message queue and retrying for create[Event]Registration operations.

Depends on

https://github.com/opencrvs/opencrvs-core/issues/6279

Sub-issues

[!WARNING] All of these need to be made so that data can never be lost. Corrections, request, making a correction and rejection Validation, marking as registered, certification, issuance, Rejecting a record, archive, reinstate We can skip these if we build new backend handlers in #7052

euanmillar commented 10 months ago

This ticket is to log the full graphql payload with any downstream backend error, then alert the admin so that when the bug is resolved, perhaps the data can be recovered in some way

rikukissa commented 6 months ago

I came across the issue again today while working in a branch with broken backend code. Record can indeed disappear if something goes wrong. Overall this issue should be considered as critical issue.

rikukissa commented 6 months ago

Zangetsu101 commented 3 months ago

A user might have submitted it at the same time with another user.

Would that even be possible? Because "Assignment" should prevent that from happening

Now if agree that we are preventing multiple users from performing some action on the same record due to assignments being in place, then hte problem boils down to handling multiple requests from the same user ( e.g. due to bad network, the first request timed out then sending the same req again).

What we could do is send a unique id with the request, generated at the client, which we can then use to identify the duplicate requests and respond accordingly. We are already doing something similar for the create*Registration mutations where we are treating the draftId as that unique identifier. We could implement something similar for the corrections

Zangetsu101 commented 3 months ago

As for making the mutations idempotent, we first need to have transactions across services. One of common patterns for this is "Saga" where for each step we need to implement a rollback which gets called if one of the later steps fail.

Another option could be to maintain some record against the transaction id (which we are kinda getting from client) about which steps have completed. Then we can retry only the failed steps and skip the completed ones. The request succeeds once all the steps are completed

rikukissa commented 3 months ago

We can agree that correction can be performed only by one user at one time, at least for time being. I do like the saga pattern idea. Wouldn't really have to be more complex than what you are proposing so implementing the "compensation" mechanism for the write operations in the flows mentioned on the ticket.

euanmillar commented 1 day ago

@rikukissa I wonder if, as we are doing this in #6386 if the content can be combined? Perhaps then this one could be closed ?

rikukissa commented 1 day ago

I moved this under #7052 for now as a lot of these improvements are made there

opencrvs / opencrvs-core