medic / cht-conf

A command-line interface for configuring Community Health Toolkit applications
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
22 stars 25 forks source link

Predict config changes for purging #563

Open m5r opened 1 year ago

m5r commented 1 year ago

Describe the issue

Ensure community members can own sustainable CHT deployments without Medic directly involved

App developers can easily visualize and quantify the impact of a change to config for purging

Additional context Related allies OKR

jkuester commented 1 year ago

Behavior Overview

(@m5r please correct this if it is wrong!)

A new cht-conf action, dry-run-purge-config has been added. When you execute this action, it will call the new API endpoint with your current purge config and print the results. The results will indicate:

m5r commented 9 months ago

As noted in the initial cht-core PR, we tried to solve this by running the purging code minus the database mutations (aka dry run) but we ran into the same limits as actual purging with slow queries that made a dry run take hours to complete. Here is a copy of our test results:

I got some disappointing news about our purging dry run solution 😞

I've started a dry run of a purge in my morning on a clone of Muso-Mali with a beefy machine with similar specs: Xeon E5-2686 v4 @ 2.30GHz, 256 GB of RAM, ~650GB of data stored on a 1.5 TB disk. I'm using a fork of CHT 3.13.0 with the purging dry run API living on the temporary branch 3.13.0-FR-dry-run-purging.

It's the beginning of the night over here and the dry run is still going. It took nearly 5 hours to simulate purging contacts, processing ~10k records with each batched request. Our assumption was that queries were cheap and mutating the data was the expensive part of purging that makes the process so slow but it turns out the queries are expensive as we're seeing roughly the same performances as actual purging despite using couchdb views.

It averages 35% of CPU usage with spikes to 80% and any loss of connection between cht-conf and the API during the dry run results in wasted CPU usage as cht-conf can't reconnect to the API to wait for the results while the API keeps running the dry run.

With all this, it's safe to say we cannot move forward with this solution and we should go back to the design step for this feature.