medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
469 stars 217 forks source link

Starting an upgrade that involves view indexing can become stuck after indexing is finished #9617

Closed dianabarsan closed 6 days ago

dianabarsan commented 2 weeks ago

Describe the bug View indexing involves a lot of recursive chained promises, that only resolve and update the state of the upgrade when indexing the views for the new version is finished. This process can become blocked in some cases. This seems to be an edge case and is not reliably reproducible. It also seems to affect CouchDb 2 (v4.2.2) and not later versions of the CHT.

To Reproduce Steps to reproduce the behavior:

  1. Install 4.2.2 on a sizeable database. Due to space concerns, I used a 800.000 docs db and changed the docker compose files to limit CPU on the CouchDb containers:
    deploy:
      resources:
        limits:
          cpus: '0.1'
  2. Stage upgrade to latest.
  3. See that your upgrade process stalls after view indexes are built.

Expected behavior Upgrades should happen smoothly.

Logs It looks like haproxy throws errors for some view indexing requests, but CouchDb never actually crashes.

Environment

Additional context Unfortunately, the workaround is manual and very technical and involves:

dianabarsan commented 1 week ago

I've replicated this locally, and found a fix. The cause is still unclear, and might need some serious networking debugging - finding which service causes view indexing query requests to hang. This was my initial guess as to what was happening.

I do have a simpler workaround that doesn't involve manually editing docs in the database.

  1. When API goes stuck after view indexing, simply restart API.
  2. The admin upgrade page will say that the upgrade was interrupted, click retry upgrade.
  3. Depending on the state of the database, you might see view indexing again. Depending on how many docs need to be indexed, indexing might get stuck again. Go back to 1 if that happens.
  4. Eventually, when indexing jobs are short enough not to trigger a request hang, you will get the button to complete the upgrade.
dianabarsan commented 1 week ago

I haven't replicated this in Couch 3, but for safe measure I will apply the change to latest CHT and test how indexing goes afterwards. Adding to 4.15.