Open dianabarsan opened 3 months ago
Added to 4.14 to at least investigate it.
We need to test for performance improvements and regressions, particularly when querying the changes feed with over 1,000 doc IDs.
Use this issue as the MVP upgrade - we can look at turning on the optional new features later.
Related: CHT Core is looking to bifurcate online and offline search which will leverage features in the latest version of couch. We're currently on couch 3.3.3
which doesn't have the latest search features. With the new couch version 3.4.2
it has a stable version of Nouveau search. Upgrading core ahead of the bifurcation would be great!
I've done a round of tests on upgrading CouchDB to version 3.4.2 on CHT.
/<database>/_explain
endpoint is restricted to offline users. The response from the database has been changed a bit from version 3.3.3 to 3.4.2. (response pasted below). The key fields
is the one that is causing the test to fail at the moment.
a. /<database>/_explain
endpoint response v3.3.3
{
"dbname": "medic",
"index": {
"ddoc": null,
"name": "_all_docs",
"type": "special",
"def": {
"fields": [
{
"_id": "asc"
}
]
}
},
"partitioned": "undefined",
"selector": {
"type": {
"$eq": "person"
}
},
"opts": {
"use_index": [],
"bookmark": "nil",
"limit": 25,
"skip": 0,
"sort": {},
"fields": "all_fields",
"partition": "",
"r": [
49
],
"conflicts": false,
"stale": false,
"update": true,
"stable": false,
"execution_stats": false
},
"limit": 25,
"skip": 0,
"fields": "all_fields",
"mrargs": {
"include_docs": true,
"view_type": "map",
"reduce": false,
"partition": null,
"start_key": null,
"end_key": "<MAX>",
"direction": "fwd",
"stable": false,
"update": true,
"conflicts": "undefined"
}
}
b. /<database>/_explain
endpoint response v3.4.2
{
"dbname": "medic",
"index": {
"ddoc": null,
"name": "_all_docs",
"type": "special",
"def": {
"fields": [
{
"_id": "asc"
}
]
}
},
"partitioned": false,
"selector": {
"type": {
"$eq": "person"
}
},
"opts": {
"use_index": [],
"bookmark": "nil",
"limit": 25,
"skip": 0,
"sort": {},
"fields": [],
"partition": "",
"r": 1,
"conflicts": false,
"stale": false,
"update": true,
"stable": false,
"execution_stats": false
},
"limit": 25,
"skip": 0,
"fields": [],
"index_candidates": [],
"selector_hints": [
{
"type": "json",
"indexable_fields": [
"type"
],
"unindexable_fields": []
}
],
"mrargs": {
"include_docs": true,
"view_type": "map",
"reduce": false,
"partition": null,
"start_key": null,
"end_key": "<MAX>",
"direction": "fwd",
"stable": false,
"update": true,
"conflicts": "undefined"
},
"covering": false
}
TODO: performance tests
@sugat009 Thinking of production scenarios: is it worth testing an upgrade from couchdb 2.3.1 with existing data (cht-core 3.x) to couchdb 3.4.2? Do you feel that's already covered? Thanks!
Seconding Hareet's suggestion to test pre-couch 3.x upgrades. Since Core 4.4 added Couch 3.x, maybe try Core 4.2 -> Core branch @ ~master with couch 3.4.x?
@Hareet @mrjones-plip yes, we should try that if it's one of the production cases.
Did an upgrade test from an instance in CHT version 4.13 with CouchDB version 3.3.3 to CouchDB 3.4.2 with 250K docs in the medic
database. There were no document losses in the upgrade process. The way I checked for it was to store the hash of every document before the upgrade and re-check the new hash with the stored hash. I only checked for the medic
database as it's the largest one and the outcome probably holds for other databases as well.
Next is clustered upgrade test
Moved to 4.15.0 so as not to hold up the release.
The upgrade test from a CHT instance with version 4.13 and clustered CouchDB version 3.3.3 to 3.4.2 was successful without any document loss. The test procedure is the same as above for a single-node CouchDB.
Next: Performance tests
Performance tests for purging and replication have been done. The test scenario is as follows:
The timing metrics are as follows.
The metrics obtained for v3.3.3 VS v3.4.2 have a major difference in the purging time. Should we perform another test to confirm the validity of these timing measurements? In the meantime, I'm checking server logs for anything unusual. CC: @jkuester @m5r @mrjones-plip
I'm seriously worried about two metrics here:
I think we should at least re-run the tests and check if we get comparable times. And if yes, it's possible we might need to re-evaluate what happens for both these actions.
After checking the logs of Sentinel and Couch, I'm guessing the major bottleneck for this is the batch size of the purge documents. The batch size was seen to be decreased from 1000 to a minimum of 15. From there on the processing, is normal but slow. I've deleted the purge DBs and rerun the purge to check if this was not a one-time thing.
yea, that was my suspicion, that we need to rework purging to hit different endpoints that are more efficient now.
I've created an issue for this: https://github.com/medic/cht-core/issues/9642
@dianabarsan - OK to make #9642 a sub-issue of this ticket? I think we don't want to release the couch v3.4.2 upgrade without using new endpoints and sub-issues are a nice new feature of GH that we can leverage to show the dependencies!
no biggie
Added it as a sub issue
The second run of purging in deployment with CouchDB version 3.4.2 has been completed. The metrics and logs are similar.
Moving it to 4.16 to not block 4.15
Update: Changes from #9651 made the purge take only 47.108 minutes in CouchDB version 3.4.2 compared to ~19 hours from before.
Yes, good news is that we don't need to make any code changes to have quick purging, we just need to adjust the changes optimization counter, so minimal effort required here.
What feature do you want to improve? CouchDb 3.4.0 will be released soon. It includes some changes that could improve things for the CHT significantly such as:
Full release notes: https://docs.couchdb.org/en/latest/whatsnew/3.4.html
Describe the improvement you'd like Upgrade CHT to use CouchDb 3.4.0