opencrvs / opencrvs-core

A global solution to civil registration
https://www.opencrvs.org
Other
85 stars 67 forks source link

Improve global error handling & logging #6208

Open jpye-finch opened 10 months ago

jpye-finch commented 10 months ago

Design principles

We need to handle errors in the system more gracefully. If there is any error whilst processing an application so that the data is never lost. The following approach aims to

Requirements

Client should never purge draft unless it can verify the record was fully written. To do this, it needs to verify the record was received by the the backend returning an unique id for the record.

This should never happen before metadata is persisted in MongoDB, attachments are stored in Minio and search indexing happens in Elasticsearch. If any of these fails, then the return value must be 500.

If one of these steps fails, the backend needs to work so the client can safely retry without duplicate entries. In other words, the backend needs to clean up after itself in the case of an error. The outcome in the database should be the same even if you first unsuccessfully submit a record 10 times, then submit it successfully once and then try submitting it 5 more times. The output should be one record written in the database.

The endpoints that this apply to are at this phase the ones writing mission critical data with a lot of user input that might not be available anymore after a potential failure. Specifically these queries:

requestRegistrationCorrection(id: ID!, details: CorrectionInput!): ID!
createBirthRegistrationCorrection(
  id: ID!
  details: BirthRegistrationInput!
): ID!
createDeathRegistrationCorrection(
  id: ID!
  details: DeathRegistrationInput!
): ID!
createMarriageRegistrationCorrection(
  id: ID!
  details: MarriageRegistrationInput!
): ID!
createBirthRegistration(details: BirthRegistrationInput!): CreatedIds!
createDeathRegistration(details: DeathRegistrationInput!): CreatedIds!
createMarriageRegistration(details: MarriageRegistrationInput!): CreatedIds!

requestRegistrationCorrection, createBirthRegistrationCorrection, createDeathRegistrationCorrection, createMarriageRegistrationCorrection

createBirthRegistration

Next iteration – ideas for the future

Tech tasks

euanmillar commented 8 months ago

This ticket is to log the full graphql payload with any downstream backend error, then alert the admin so that when the bug is resolved, perhaps the data can be recovered in some way

rikukissa commented 4 months ago

I came across the issue again today while working in a branch with broken backend code. Record can indeed disappear if something goes wrong. Overall this issue should be considered as critical issue.

Related: https://github.com/opencrvs/opencrvs-core/issues/6466

rikukissa commented 4 months ago

Related: https://github.com/opencrvs/opencrvs-core/issues/6386

Zangetsu101 commented 3 weeks ago

A user might have submitted it at the same time with another user.

Would that even be possible? Because "Assignment" should prevent that from happening

Now if agree that we are preventing multiple users from performing some action on the same record due to assignments being in place, then hte problem boils down to handling multiple requests from the same user ( e.g. due to bad network, the first request timed out then sending the same req again).

What we could do is send a unique id with the request, generated at the client, which we can then use to identify the duplicate requests and respond accordingly. We are already doing something similar for the create*Registration mutations where we are treating the draftId as that unique identifier. We could implement something similar for the corrections

Zangetsu101 commented 3 weeks ago

As for making the mutations idempotent, we first need to have transactions across services. One of common patterns for this is "Saga" where for each step we need to implement a rollback which gets called if one of the later steps fail.

Another option could be to maintain some record against the transaction id (which we are kinda getting from client) about which steps have completed. Then we can retry only the failed steps and skip the completed ones. The request succeeds once all the steps are completed

rikukissa commented 3 weeks ago

We can agree that correction can be performed only by one user at one time, at least for time being. I do like the saga pattern idea. Wouldn't really have to be more complex than what you are proposing so implementing the "compensation" mechanism for the write operations in the flows mentioned on the ticket.