medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
439 stars 210 forks source link

PouchDb `indexed_db_went_bad` due to device inactivity #8155

Open dianabarsan opened 1 year ago

dianabarsan commented 1 year ago

Describe the bug This is similar to https://github.com/medic/cht-core/issues/8149 Except the error is different. It is also triggered differently, as it happens when an attempt to write times out, compared to the other error that happens on a read.

Please insert statistics @kennsippell

To Reproduce Steps to reproduce the behavior:

  1. Replicate a user with a large number of documents (according to data, above 25000 docs)
  2. Create a script that takes 100 documents that the user should see, deletes all doc attachments and adds lots of keys to the docs. I added ~5000 uuid keys and values. The purpose is to make the bulkDocs payload large enough that there is enough of a time span to put the phone to sleep.
  3. Connect phone to desktop and use chrome://inspect in order to track network requests on the device.
  4. Run the script that updates/creates 100 ultra large docs.
  5. Sync. Watching the network tab, put the phone to sleep immediately after you see the outgoing _bulk_get request finish.
  6. Put the phone to sleep immediately. Wait for a few minutes.

Logs

The logs will be riddled with Failed to execute 'transaction' on 'IDBDatabase errors, most likely. Scroll through the errors until you see the indexed_db_went_bad error.

image

Example feedback doc created during event: https://gist.github.com/dianabarsan/8d4ca724f5833df8c602d89d045468a6

Screenshots image

Environment

Additional context This was a lot trickier to replicate than https://github.com/medic/cht-core/issues/8149. I think I only saw these errors in the 5th or 6th try, after I had reduced CPU performance on the phone by 4x. Just as with the other issue, reloading the app and keeping the phone screen on fixes the issue.

kennsippell commented 1 year ago

Prevalence

Looking at Global Database Failures in feedback documents for LG-UG, PiH, Muso, and D-Tree there have been 447 failures in 2023. If we look at https://github.com/medic/cht-core/issues/8155 and https://github.com/medic/cht-core/issues/8149 together + make the below assumptions then we can conclude that ~92% of Global Database Failures are caused when devices restore from sleep.

Global DB Classification Count Percentage of Global DB Failures
#8155 327 73%
#8149 85 19%
Other 35 7.8%

Impact

We know that this particular database failure can be recovered by restarting the app. That said, users across multiple countries and projects (both LG-UG and Siaya) have independently responded to this by clearing their device data and re-synchronizing despite our attempts to help them recover.

This causes data-loss and significant disturbance. Based on confirmed user behaviors + reports from project staff, one should assume the true rate is notably higher than measured. Forum Thread

Mitigation

Additional analysis of global database failures normalized by "CHT Effort" yields other insights which can be useful for understanding and mitigating this issue at scale:

  1. Significant variation project-by-project. For example, LG-UG has 3.5x higher rate of failure than Muso.
  2. Significant variation by devices. For example, TECNO B-Class Devices within LG-UG have a 3.3x higher rate of failure than other devices.
  3. Users with more documents on their device experience an increasing rate of failure. The trend is roughly exponential.
  4. Users using a legacy XWalk version of cht-android experience a 3.9x higher rate of failure when compared to users running cht-android 1.0 or comparable webview versions + recent Chrome

Assumptions