Diagnosis. When recent submission table migrations occurred, there was a unsatisfiable constraint on some submission rows that required a user ID. At that point the migration did not succeed, but since it was in its own process, it just hung without reporting an error in a state where those rows were not migrated. Upon the next data ingest, the incomplete data was forcibly made durable by the A/B database switch. So the primary source of data loss was caused by a partial migration that did not send out any alert and failed silently.
Resolution. The resolution is to ensure mutations to the database can never be made durable without complete guarantee checks, which can be done by (1) putting the data changes into a single transaction which either completely succeed with all changes or fails with no changes, (2) alert us of any failure by stopping server startup on migration failure instead of silently letting the server startup, and (3) as a failsafe, automatically backup any data the data or submission portal authors (e.g. submissions table) before any data change.
The following specific tasks will be taken to accomplish the resolution plan:
[x] Put ingest and migration in Postgres transactions. This will ensure consistency of the data, so the final state of the data is guaranteed.
[x] Move migration step from separate process to part of the entrypoint script, which first runs backup, then migration, then start server
[ ] Implement a backup script which dumps database content into a separate volume
[x] We will demonstrate that this resolution is in place is with a dry run where we attempt to push a known-bad migration to dev and verify that it fails loudly without putting the server in a data-incomplete state. At that point we can push the resolution to prod and know it should function as expected, and we can be confident of letting users continue to enter data.
There are a few related items that are not a direct part of the resolution but will help in data and state stability:
[ ] Also save data when losing focus on any study metadata field
[ ] Add "validated" column
[ ] Ensure submit button is never enabled so nothing will be marked "completed" and thought ready to move to mongo
[ ] Notify the user in the submission UI if a save fails
[ ] Update all submissions to "in progress" state so none are marked "completed". Only old submissions are marked completed currently from old application logic.
Diagnosis. When recent submission table migrations occurred, there was a unsatisfiable constraint on some submission rows that required a user ID. At that point the migration did not succeed, but since it was in its own process, it just hung without reporting an error in a state where those rows were not migrated. Upon the next data ingest, the incomplete data was forcibly made durable by the A/B database switch. So the primary source of data loss was caused by a partial migration that did not send out any alert and failed silently.
Resolution. The resolution is to ensure mutations to the database can never be made durable without complete guarantee checks, which can be done by (1) putting the data changes into a single transaction which either completely succeed with all changes or fails with no changes, (2) alert us of any failure by stopping server startup on migration failure instead of silently letting the server startup, and (3) as a failsafe, automatically backup any data the data or submission portal authors (e.g. submissions table) before any data change.
The following specific tasks will be taken to accomplish the resolution plan:
There are a few related items that are not a direct part of the resolution but will help in data and state stability: