medic / cht-sync

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
GNU General Public License v3.0
3 stars 5 forks source link

add tasks back to MoH KE #157

Open witash opened 1 month ago

witash commented 1 month ago

Describe the issue MoH KE provisioned 1TB storage for the postgres database, in a postgres managed by kubernetes The persistent volume for this cannot be easily resized. 1TB was not enough to store the entire postgres database, and the initial dbt run to populate the database was failing. To get the dashboards working as quickly as possible, all documents with type 'task' were deleted. Since tasks accounted for 2/3 of the storage usage (before material views), this freed up enough space to allow the initial dbt run to complete However, tasks are still useful for analytics and debugging, and deleting them is not a permanent solution.

Describe the improvement you'd like

  1. When all the materialized views are finished, calculate exact storage requirements by adding a buffer of 1/2 (for temp tables and later additions) to the storage usage when everything is finished
  2. Create a new postgres instance with at least this much storage; either it could be a new node in the cluster, or, if the external postgres has enough free storage, switch to using that
  3. Create a new cht-sync instance using that postgres. It will take some time (about one week) to do the initial run, so leave the old one running.
  4. when it is finished, switch dashboards to the new postgres and stop and delete the old cht-sync

Describe alternatives you've considered There's other ways to get the old tasks, I think this will be the simplest overall. We could try and save time by pgdump/restore the current database, but it then gets complicated with how merge the old tasks

  1. To save time for the inital dbt run, pgdump and restore to the new postgres, excluding the source table. This may take a long time but cht-sync can keep updating while that is running.
  2. in the new database, truncate couch2pg_progress
  3. Run a new instance of cht-sync, without dbt, poitning to the new database. This will sync
  4. when its finished (1-3 days), stop it and restore couch2pg_progress from the saved table
  5. Switch the normal cht-sync to the new postgres. This will cause dashboards to be out of date again; it's starting over from where it was at step 4. That should be only a few days (at most week) of data so it should catch up quickly. couch2pg will try to sync some tasks that are already in the db; it will update saved_timestamp and treat them like updates, but it will not create duplicates.
dianabarsan commented 1 month ago

I think going with the simple solution is the better way, and involves less human intervention, less steps and less potential ways where this can go wrong. Thanks for writing this up