medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
469 stars 217 forks source link

Upgrade to CouchDb 3.4.x #9303

Open dianabarsan opened 3 months ago

dianabarsan commented 3 months ago

What feature do you want to improve? CouchDb 3.4.0 will be released soon. It includes some changes that could improve things for the CHT significantly such as:

Full release notes: https://docs.couchdb.org/en/latest/whatsnew/3.4.html

Describe the improvement you'd like Upgrade CHT to use CouchDb 3.4.0

garethbowen commented 1 month ago

Added to 4.14 to at least investigate it.

We need to test for performance improvements and regressions, particularly when querying the changes feed with over 1,000 doc IDs.

Use this issue as the MVP upgrade - we can look at turning on the optional new features later.

mrjones-plip commented 4 weeks ago

Related: CHT Core is looking to bifurcate online and offline search which will leverage features in the latest version of couch. We're currently on couch 3.3.3 which doesn't have the latest search features. With the new couch version 3.4.2 it has a stable version of Nouveau search. Upgrading core ahead of the bifurcation would be great!

sugat009 commented 4 weeks ago

I've done a round of tests on upgrading CouchDB to version 3.4.2 on CHT.

  1. With the docker helper 4. x setup, I updated the CouchDB container to version 3.4.2. According to the logs, all the services seem to be working. I also created a few users, contacts, etc.
  2. I ran unit tests, integration tests, and wdio tests locally and in the CI by creating PR. One of the tests is failing at the moment. (CI Run) The test is checking if /<database>/_explain endpoint is restricted to offline users. The response from the database has been changed a bit from version 3.3.3 to 3.4.2. (response pasted below). The key fields is the one that is causing the test to fail at the moment. a. /<database>/_explain endpoint response v3.3.3
    {
    "dbname": "medic",
    "index": {
        "ddoc": null,
        "name": "_all_docs",
        "type": "special",
        "def": {
            "fields": [
                {
                    "_id": "asc"
                }
            ]
        }
    },
    "partitioned": "undefined",
    "selector": {
        "type": {
            "$eq": "person"
        }
    },
    "opts": {
        "use_index": [],
        "bookmark": "nil",
        "limit": 25,
        "skip": 0,
        "sort": {},
        "fields": "all_fields",
        "partition": "",
        "r": [
            49
        ],
        "conflicts": false,
        "stale": false,
        "update": true,
        "stable": false,
        "execution_stats": false
    },
    "limit": 25,
    "skip": 0,
    "fields": "all_fields",
    "mrargs": {
        "include_docs": true,
        "view_type": "map",
        "reduce": false,
        "partition": null,
        "start_key": null,
        "end_key": "<MAX>",
        "direction": "fwd",
        "stable": false,
        "update": true,
        "conflicts": "undefined"
    }
    }

    b. /<database>/_explain endpoint response v3.4.2

    {
    "dbname": "medic",
    "index": {
        "ddoc": null,
        "name": "_all_docs",
        "type": "special",
        "def": {
            "fields": [
                {
                    "_id": "asc"
                }
            ]
        }
    },
    "partitioned": false,
    "selector": {
        "type": {
            "$eq": "person"
        }
    },
    "opts": {
        "use_index": [],
        "bookmark": "nil",
        "limit": 25,
        "skip": 0,
        "sort": {},
        "fields": [],
        "partition": "",
        "r": 1,
        "conflicts": false,
        "stale": false,
        "update": true,
        "stable": false,
        "execution_stats": false
    },
    "limit": 25,
    "skip": 0,
    "fields": [],
    "index_candidates": [],
    "selector_hints": [
        {
            "type": "json",
            "indexable_fields": [
                "type"
            ],
            "unindexable_fields": []
        }
    ],
    "mrargs": {
        "include_docs": true,
        "view_type": "map",
        "reduce": false,
        "partition": null,
        "start_key": null,
        "end_key": "<MAX>",
        "direction": "fwd",
        "stable": false,
        "update": true,
        "conflicts": "undefined"
    },
    "covering": false
    }

TODO: performance tests

Hareet commented 3 weeks ago

@sugat009 Thinking of production scenarios: is it worth testing an upgrade from couchdb 2.3.1 with existing data (cht-core 3.x) to couchdb 3.4.2? Do you feel that's already covered? Thanks!

mrjones-plip commented 3 weeks ago

Seconding Hareet's suggestion to test pre-couch 3.x upgrades. Since Core 4.4 added Couch 3.x, maybe try Core 4.2 -> Core branch @ ~master with couch 3.4.x?

sugat009 commented 3 weeks ago

@Hareet @mrjones-plip yes, we should try that if it's one of the production cases.

sugat009 commented 3 weeks ago

Did an upgrade test from an instance in CHT version 4.13 with CouchDB version 3.3.3 to CouchDB 3.4.2 with 250K docs in the medic database. There were no document losses in the upgrade process. The way I checked for it was to store the hash of every document before the upgrade and re-check the new hash with the stored hash. I only checked for the medic database as it's the largest one and the outcome probably holds for other databases as well.

Next is clustered upgrade test

lorerod commented 3 weeks ago

Moved to 4.15.0 so as not to hold up the release.

sugat009 commented 3 weeks ago

The upgrade test from a CHT instance with version 4.13 and clustered CouchDB version 3.3.3 to 3.4.2 was successful without any document loss. The test procedure is the same as above for a single-node CouchDB.

Next: Performance tests

sugat009 commented 1 week ago

Performance tests for purging and replication have been done. The test scenario is as follows:

  1. Deploy an instance of EKS with CouchDB v3.3.3.
  2. Add data ~5M using test-data-generator with each CHW having ~15K data(mostly reports)
  3. Log in through a client device. A different browser or a phone.
  4. Purge ~10K of those data(reports)
  5. Log in through the same device from 3 and sync.
  6. Upgrade CouchDB to v3.4.2
  7. Delete the purge databases
  8. Do steps 3-6 again for v3.4.2

The timing metrics are as follows.

  1. 3.3.3
    1. Replication before purging
      1. Polling data: 6.83s
      2. Actual Replication: 50.76s
    2. Purging
      1. Time taken for purging: 60.46 minutes
    3. Sync after purging
      1. Time taken: 32s
  2. 3.4.2
    1. Replication before purging
      1. Polling data: 7.03s
      2. Actual replication: 55.94s
    2. Purging
      1. Time taken for purging: 1148.24 minutes = 19.18 hours
    3. Sync after purging
      1. Polling data: 5.007909633333 minutes
      2. Actual replication: 37s

The metrics obtained for v3.3.3 VS v3.4.2 have a major difference in the purging time. Should we perform another test to confirm the validity of these timing measurements? In the meantime, I'm checking server logs for anything unusual. CC: @jkuester @m5r @mrjones-plip

dianabarsan commented 1 week ago

I'm seriously worried about two metrics here:

I think we should at least re-run the tests and check if we get comparable times. And if yes, it's possible we might need to re-evaluate what happens for both these actions.

sugat009 commented 1 week ago

After checking the logs of Sentinel and Couch, I'm guessing the major bottleneck for this is the batch size of the purge documents. The batch size was seen to be decreased from 1000 to a minimum of 15. From there on the processing, is normal but slow. I've deleted the purge DBs and rerun the purge to check if this was not a one-time thing.

dianabarsan commented 1 week ago

yea, that was my suspicion, that we need to rework purging to hit different endpoints that are more efficient now.

dianabarsan commented 1 week ago

I've created an issue for this: https://github.com/medic/cht-core/issues/9642

mrjones-plip commented 1 week ago

@dianabarsan - OK to make #9642 a sub-issue of this ticket? I think we don't want to release the couch v3.4.2 upgrade without using new endpoints and sub-issues are a nice new feature of GH that we can leverage to show the dependencies!

no biggie

dianabarsan commented 1 week ago

Added it as a sub issue

sugat009 commented 1 week ago

The second run of purging in deployment with CouchDB version 3.4.2 has been completed. The metrics and logs are similar.

  1. Purging
    1. Time taken: 1179.79 minutes = 19.66 hours
  2. Sync after purging
    1. Polling data: 5.3 minutes
    2. Actual replication: 24s
latin-panda commented 4 days ago

Moving it to 4.16 to not block 4.15

sugat009 commented 3 days ago

Update: Changes from #9651 made the purge take only 47.108 minutes in CouchDB version 3.4.2 compared to ~19 hours from before.

dianabarsan commented 2 days ago

Yes, good news is that we don't need to make any code changes to have quick purging, we just need to adjust the changes optimization counter, so minimal effort required here.