Open dianabarsan opened 1 year ago
Looking at Global Database Failures in feedback documents for LG-UG, PiH, Muso, and D-Tree there have been 447 failures in 2023. If we look at https://github.com/medic/cht-core/issues/8155 and https://github.com/medic/cht-core/issues/8149 together + make the below assumptions then we can conclude that ~92% of Global Database Failures are caused when devices restore from sleep.
Global DB Classification | Count | Percentage of Global DB Failures |
---|---|---|
#8155 | 327 | 73% |
#8149 | 85 | 19% |
Other | 35 | 7.8% |
We know that this particular database failure can be recovered by restarting the app. That said, users across multiple countries and projects (both LG-UG and Siaya) have independently responded to this by clearing their device data and re-synchronizing despite our attempts to help them recover.
This causes data-loss and significant disturbance. Based on confirmed user behaviors + reports from project staff, one should assume the true rate is notably higher than measured. Forum Thread
Additional analysis of global database failures normalized by "CHT Effort" yields other insights which can be useful for understanding and mitigating this issue at scale:
Database has a global failure
and indexed_db_went_bad Reason:Timeout
togetherDatabase has a global failure
and Failed to execute 'transaction' on 'IDBDatabase'
without indexed_db_went_bad Reason:Timeout
Describe the bug This is similar to https://github.com/medic/cht-core/issues/8149 Except the error is different. It is also triggered differently, as it happens when an attempt to write times out, compared to the other error that happens on a read.
Please insert statistics @kennsippell
To Reproduce Steps to reproduce the behavior:
chrome://inspect
in order to track network requests on the device._bulk_get
request finish.Logs
The logs will be riddled with
Failed to execute 'transaction' on 'IDBDatabase
errors, most likely. Scroll through the errors until you see theindexed_db_went_bad
error.Example feedback doc created during event: https://gist.github.com/dianabarsan/8d4ca724f5833df8c602d89d045468a6
Screenshots
Environment
Additional context This was a lot trickier to replicate than https://github.com/medic/cht-core/issues/8149. I think I only saw these errors in the 5th or 6th try, after I had reduced CPU performance on the phone by 4x. Just as with the other issue, reloading the app and keeping the phone screen on fixes the issue.