medic / cht-sync

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
GNU General Public License v3.0
4 stars 5 forks source link

Run dbt models only on the relevant databases #176

Open 1yuv opened 3 weeks ago

1yuv commented 3 weeks ago

Describe the issue It looks like currently dbt models are run on all the couchdb databases synced. While there's an issue in cht-sync syncing multiple databases (#165 ), if you specify any database other than medic, it would still sync. Couchdb has separated databases for different types of data storing, and running default dbt models on all databases is not helpful and this can actualy cause a performance problem.

Describe the improvement you'd like We should separate the couchdb database and models we want to run on them. For example, the dbt models you want to run on medic will be totally different from medic-users-meta or medic-sentinel databases.

This is critical also from the perfromance point of view. There's no point in running the dbt models where it's not going to make any changes. medic-users-meta and medic-sentinel dbs could be as big as medic and running medic's models on those databases is not helpful.

Related: https://github.com/medic/cht-pipeline/issues/168 is related to this as it looks as of now we only have models defined for medic database.

dianabarsan commented 3 weeks ago

This is a serious shift from the way cht-sync works now and the way couch2pg worked before, and would likely require a significant overhaul of how data is managed.

I don't think this is a straight-forward technical issue, because one solution requires having multiple data source tables in Postgres, one for each medic database, or at least keep medic separate from the others. This would mean that any query that relies on all data being present in a single source would break. I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.

Another solution could be to store which database which document was copied from, and update the base models to exclude non-medic docs. This would still mean that base models would run over all documents, but subsequent modules would not. We would need to quantify how big of a performance improvement this yields.

witash commented 3 weeks ago

Another solution could be to store which database which document was copied from

Seems like the simplest solution, and I don't think it will be a significant performance issue to have more rows in the source table which are then ignored in all downstream models.

1yuv commented 3 weeks ago

because one solution requires having multiple data source tables in Postgres

Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic database's data to one table and medic-users-meta or any other database's data to another table?

I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.

I am not suggesting storing data duplicately. The main concern of this issue is to run dbt models that are required to run or that are useful to run on that data set. It does not make any sense running models like contact or patient on database other than medic.

dianabarsan commented 3 weeks ago

Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic database's data to one table and medic-users-meta or any other database's data to another table?

If anyone wrote models or queries directly against the main database, those would be broken. So it's a potentially breaking change.

dianabarsan commented 3 weeks ago

Not to mention that migrating to the new "structure" will require a re-sync and model rebuild for all projects that have already deployed cht-sync.

1yuv commented 7 hours ago

We added medic-users-meta to be synced recently on one of our deployment without adding any new models or tests. The dbt run time increased from 4 hours to 20 hours. Since all databases are still inserted into one table and all models are made from that table. I will post update after next run.

19:10:33  Finished running 5 materialized view models, 1 project hook in 0 hours 10 minutes and 4.17 seconds (604.17s).
21:47:57  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 37 minutes and 15.21 seconds (9435.21s).
23:44:43  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 32.44 seconds (6932.44s).
01:39:00  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 53 minutes and 4.41 seconds (6784.41s).
03:42:37  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 2 minutes and 25.67 seconds (7345.67s).
05:51:18  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 7 minutes and 28.30 seconds (7648.30s).
08:27:46  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 35 minutes and 12.13 seconds (9312.13s).
11:54:04  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 25 minutes and 0.88 seconds (12300.88s).
15:25:43  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 30 minutes and 19.57 seconds (12619.57s).
18:20:25  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 53 minutes and 28.41 seconds (10408.42s).
20:47:46  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 26 minutes and 6.52 seconds (8766.52s).
22:39:16  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 50 minutes and 15.65 seconds (6615.65s).
00:32:14  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 51 minutes and 48.41 seconds (6708.41s).
02:28:44  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 13.71 seconds (6913.71s).
05:14:03  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 44 minutes and 4.30 seconds (9844.30s).
09:40:17  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 24 minutes and 59.23 seconds (15899.23s).
15:06:27  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 5 hours 24 minutes and 53.49 seconds (19493.49s).
19:20:36  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 12 minutes and 53.30 seconds (15173.30s).
15:22:33  Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 20 hours 0 minutes and 43.03 seconds (72043.03s).