Open 1yuv opened 3 weeks ago
This is a serious shift from the way cht-sync works now and the way couch2pg worked before, and would likely require a significant overhaul of how data is managed.
I don't think this is a straight-forward technical issue, because one solution requires having multiple data source tables in Postgres, one for each medic database, or at least keep medic separate from the others. This would mean that any query that relies on all data being present in a single source would break. I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.
Another solution could be to store which database which document was copied from, and update the base models to exclude non-medic docs. This would still mean that base models would run over all documents, but subsequent modules would not. We would need to quantify how big of a performance improvement this yields.
Another solution could be to store which database which document was copied from
Seems like the simplest solution, and I don't think it will be a significant performance issue to have more rows in the source table which are then ignored in all downstream models.
because one solution requires having multiple data source tables in Postgres
Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic
database's data to one table and medic-users-meta
or any other database's data to another table?
I think keeping duplicate data is problematic for disk space, so having both - one common table and several separate tables - is probably very costly.
I am not suggesting storing data duplicately. The main concern of this issue is to run dbt models that are required to run or that are useful to run on that data set. It does not make any sense running models like contact or patient on database other than medic
.
Is it not possible on cht-sync to store data in two separate tables? As under the hood cht-sync also uses cht-couch2pg to sync data, is it not possible to insert medic database's data to one table and medic-users-meta or any other database's data to another table?
If anyone wrote models or queries directly against the main database, those would be broken. So it's a potentially breaking change.
Not to mention that migrating to the new "structure" will require a re-sync and model rebuild for all projects that have already deployed cht-sync.
We added medic-users-meta
to be synced recently on one of our deployment without adding any new models or tests. The dbt run time increased from 4 hours to 20 hours. Since all databases are still inserted into one table and all models are made from that table. I will post update after next run.
19:10:33 Finished running 5 materialized view models, 1 project hook in 0 hours 10 minutes and 4.17 seconds (604.17s).
21:47:57 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 37 minutes and 15.21 seconds (9435.21s).
23:44:43 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 32.44 seconds (6932.44s).
01:39:00 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 53 minutes and 4.41 seconds (6784.41s).
03:42:37 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 2 minutes and 25.67 seconds (7345.67s).
05:51:18 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 7 minutes and 28.30 seconds (7648.30s).
08:27:46 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 35 minutes and 12.13 seconds (9312.13s).
11:54:04 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 25 minutes and 0.88 seconds (12300.88s).
15:25:43 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 3 hours 30 minutes and 19.57 seconds (12619.57s).
18:20:25 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 53 minutes and 28.41 seconds (10408.42s).
20:47:46 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 26 minutes and 6.52 seconds (8766.52s).
22:39:16 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 50 minutes and 15.65 seconds (6615.65s).
00:32:14 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 51 minutes and 48.41 seconds (6708.41s).
02:28:44 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 1 hours 55 minutes and 13.71 seconds (6913.71s).
05:14:03 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 2 hours 44 minutes and 4.30 seconds (9844.30s).
09:40:17 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 24 minutes and 59.23 seconds (15899.23s).
15:06:27 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 5 hours 24 minutes and 53.49 seconds (19493.49s).
19:20:36 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 4 hours 12 minutes and 53.30 seconds (15173.30s).
15:22:33 Finished running 43 incremental models, 35 materialized view models, 1 table model, 1 project hook in 20 hours 0 minutes and 43.03 seconds (72043.03s).
Describe the issue It looks like currently dbt models are run on all the couchdb databases synced. While there's an issue in cht-sync syncing multiple databases (#165 ), if you specify any database other than
medic
, it would still sync. Couchdb has separated databases for different types of data storing, and running default dbt models on all databases is not helpful and this can actualy cause a performance problem.Describe the improvement you'd like We should separate the couchdb database and models we want to run on them. For example, the dbt models you want to run on
medic
will be totally different frommedic-users-meta
ormedic-sentinel
databases.This is critical also from the perfromance point of view. There's no point in running the dbt models where it's not going to make any changes.
medic-users-meta
andmedic-sentinel
dbs could be as big asmedic
and runningmedic
's models on those databases is not helpful.Related: https://github.com/medic/cht-pipeline/issues/168 is related to this as it looks as of now we only have models defined for
medic
database.