Open jpye-finch opened 1 year ago
This ticket is to log the full graphql payload with any downstream backend error, then alert the admin so that when the bug is resolved, perhaps the data can be recovered in some way
I came across the issue again today while working in a branch with broken backend code. Record can indeed disappear if something goes wrong. Overall this issue should be considered as critical issue.
Related: https://github.com/opencrvs/opencrvs-core/issues/6466
A user might have submitted it at the same time with another user.
Would that even be possible? Because "Assignment" should prevent that from happening
Now if agree that we are preventing multiple users from performing some action on the same record due to assignments being in place, then hte problem boils down to handling multiple requests from the same user ( e.g. due to bad network, the first request timed out then sending the same req again).
What we could do is send a unique id with the request, generated at the client, which we can then use to identify the duplicate requests and respond accordingly. We are already doing something similar for the create*Registration
mutations where we are treating the draftId
as that unique identifier. We could implement something similar for the corrections
As for making the mutations idempotent, we first need to have transactions across services. One of common patterns for this is "Saga" where for each step we need to implement a rollback which gets called if one of the later steps fail.
Another option could be to maintain some record against the transaction id (which we are kinda getting from client) about which steps have completed. Then we can retry only the failed steps and skip the completed ones. The request succeeds once all the steps are completed
We can agree that correction can be performed only by one user at one time, at least for time being. I do like the saga pattern idea. Wouldn't really have to be more complex than what you are proposing so implementing the "compensation" mechanism for the write operations in the flows mentioned on the ticket.
@rikukissa I wonder if, as we are doing this in #6386 if the content can be combined? Perhaps then this one could be closed ?
I moved this under #7052 for now as a lot of these improvements are made there
Design principles
We need to handle errors in the system more gracefully. If there is any error whilst processing an application so that the data is never lost. The following approach aims to
Requirements
Client should never purge draft unless it can verify the record was fully written. To do this, it needs to verify the record was received either by
Status should never be OK and the client should never remove the local record before metadata is persisted in MongoDB, attachments are stored in Minio and search indexing happens in Elasticsearch. If any of these fails, the record stays in the queue indefinitely and alerts are sent to system admin.
If one of these steps fails, the backend needs to work so the client can safely retry without duplicate entries. In other words, the backend operations need to be idempotent. The outcome in the database should be the same even if you first unsuccessfully submit a record 10 times, then submit it successfully once and then try submitting it 5 more times. The output should be one record written in the database.
Depends on
Sub-issues