Open njuguna-n opened 2 months ago
There are two possible ways to go about this:
Have the tables being created and populated by couch2pg during that first sync with dbt not running and once the sync has less than a specified number of docs yet to sync then the flag can be removed and dbt can run on the existing and incoming data. This would be the fastest but the schema for the tables would be in two places.
Have the dbt container check the flag and create the tables in a non-incremental way. To keep it simple and efficient the sync would have to run and be nearly complete or with a small number of docs yet to be synced so that the tables are only created once and then once the flag is off dbt can update the tables incrementally.
My preference is for option 1 since the main tables are being populated incrementally. I will start by exploring if there is a way to get the schema from the CHT Pipeline base table package so that we have only one source of truth for the schemas.
I did some tests on using batched dbt runs with the following results:
With a single Couch2pg instance replicating about 700k docs at a rate of 1000 docs per batch. I set the dbt batch at 10000 records and dbt took on average 3121 seconds to catch up and process all records. Without batching the average time was 2263 seconds
With 7 Couch2pg instances each replicating ~700k docs in batches of 1000 regular dbt runs were much faster than batched dbt runs. This varied depending on the value I set for the batch size but regular dbt runs were still faster since there was no limit to the number of rows to be processed with each run.
I also tried with 20 Couch2pg instances and added a delay between dbt runs of about 5 minutes to simulate models taking long to complete but even in that scenario regular dbt runs we often faster.
My conclusion from this is that in 99% of cases just using incremental tables with dbt runs as they are will be sufficient and performant but in rare cases like MoH Kenya where we have a lot of instances with millions of docs and constrained database resources then batched dbt runs would definitely help but we would be trading off speed for resource conservation. In this case we can run dbt in batches by setting a flag as an environment variable. I have updated this PR to refelct this approach.
@njuguna-n dbt 1.9 has a new feature for microbatching incremental models . Could this new feature solve the problem this ticket is stating?
Additionally, you can see more info in this Coalesce presentation, starting from minute 25.
@andrablaj Yes, this feature can solve the problem because the incremental model is loaded in batches. The only drawback I see with using this incremental strategy is that the batch size is currently only configurable to be a day and depending on the Couch2pg sync rate and the available Postgres resource the number of documents to be processed might still cause an issue.
I will test this config today and assess whether the CHT Sync and CHT Pipeline PRs for this issue are still required.
Testing microbatching with a beta release of dbt 1.9 and this branch of CHT Pipeline I get the error below.
2024-10-14 18:36:57 15:36:57 Running with dbt=1.9.0-b1
2024-10-14 18:36:57 15:36:57 Registered adapter: postgres=1.9.0-b1
2024-10-14 18:36:57 15:36:57 Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:36:57 15:36:57 The selection criterion 'state:modified' does not match any enabled nodes
2024-10-14 18:36:57 15:36:57 Nothing to do. Try checking your model configs and model specification args
2024-10-14 18:36:59 15:36:59 Running with dbt=1.9.0-b1
2024-10-14 18:37:00 15:37:00 Registered adapter: postgres=1.9.0-b1
2024-10-14 18:37:00 15:37:00 Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:37:00 15:37:00
2024-10-14 18:37:00 15:37:00 Concurrency: 1 threads (target='default')
2024-10-14 18:37:00 15:37:00
2024-10-14 18:37:00 15:37:00 1 of 9 START sql incremental model v1.dbt_results .............................. [RUN]
2024-10-14 18:37:00 15:37:00 1 of 9 OK created sql incremental model v1.dbt_results ......................... [INSERT 0 0 in 0.11s]
2024-10-14 18:37:00 15:37:00 2 of 9 START sql microbatch model v1.document_metadata ......................... [RUN]
2024-10-14 18:37:00 15:37:00 1 of 1 START batch 2024-10-14 of v1.document_metadata .......................... [RUN]
2024-10-14 18:37:00 15:37:00 1 of 1 ERROR creating batch 2024-10-14 of v1.document_metadata ................. [ERROR in 0.00s]
2024-10-14 18:37:00 15:37:00 Encountered an error while running operation: Compilation Error
2024-10-14 18:37:00 'dbt.artifacts.resources.v1.components.ColumnInfo object' has no attribute 'get'
2024-10-14 18:37:00
2024-10-14 18:37:00 > in macro get_empty_schema_sql (macros/adapters/columns.sql)
2024-10-14 18:37:00 > called by macro assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00 > called by macro default__get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00 > called by macro get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00 > called by macro postgres__create_table_as (macros/adapters.sql)
2024-10-14 18:37:00 > called by macro create_table_as (macros/relations/table/create.sql)
2024-10-14 18:37:00 > called by macro default__get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00 > called by macro get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00 > called by macro materialization_incremental_default (macros/materializations/models/incremental/incremental.sql)
2024-10-14 18:37:00 > called by <Unknown>
@njuguna-n per this discussion, the error above might happen because the dbt-core version is ahead of the postgres adapter version (Registered adapter: postgres=1.9.0-b1
might not exist yet). Have you tried forcing a lower version of the postgres adapter?
Actually, there is a dbt-postgres pre-release 1.9.0-b1. So nevermind.
I am using 1.9.0b1
for both dbt-core
and dbt-postgres
.
I have not been able to make the document_metadata model to build successfully using the new microbatching incremental strategy using the pre-release versions of dbt-core and dbt-postgres mentioned above. This is still worth pursuing and testing again once the generally available 1.9 release is ready.
Batching is currently an enhancement that will not be required in most CHT deployments. Additionally, the current solution involves code changes in both the CHT Sync and CHT Pipeline repos with a hard-coded table name being used making future base model updates slightly more fragile. WIth incremental microbatching we would limit the required changes to only the models defined in the CHT Pipeline repo. There is no deployment of CHT or CHT Sync that is actively waiting for this improvement right now so I suggest we hold off on merging the two linked PRs and closing them once we confirm incremental microbatching works as expected.
I tried thedbt-core 1.9.0-b1
version with the test-microbatch-model
branch and local couchDB (to which I generated dummy data with the test-data-generator), and I didn't get errors:
2024-10-18 17:50:32 17:50:32 Running with dbt=1.9.0-b1
2024-10-18 17:50:32 17:50:32 Registered adapter: postgres=1.9.0-b1
2024-10-18 17:50:32 17:50:32 Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-18 17:50:32 17:50:32 The selection criterion 'state:modified' does not match any enabled nodes
2024-10-18 17:50:32 17:50:32 Nothing to do. Try checking your model configs and model specification args
2024-10-18 17:50:33 17:50:33 Running with dbt=1.9.0-b1
2024-10-18 17:50:33 17:50:33 Registered adapter: postgres=1.9.0-b1
2024-10-18 17:50:33 17:50:33 Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-18 17:50:33 17:50:33
2024-10-18 17:50:33 17:50:33 Concurrency: 1 threads (target='default')
2024-10-18 17:50:33 17:50:33
2024-10-18 17:50:33 17:50:33 1 of 9 START sql incremental model v1.dbt_results .............................. [RUN]
2024-10-18 17:50:33 17:50:33 1 of 9 OK created sql incremental model v1.dbt_results ......................... [INSERT 0 0 in 0.12s]
2024-10-18 17:50:33 17:50:33 2 of 9 START sql microbatch model v1.document_metadata ......................... [RUN]
2024-10-18 17:50:33 17:50:33 2 of 9 OK created sql microbatch model v1.document_metadata .................... [MERGE 49 in 0.10s]
2024-10-18 17:50:33 17:50:33 3 of 9 START sql incremental model v1.contact .................................. [RUN]
2024-10-18 17:50:33 17:50:33 3 of 9 OK created sql incremental model v1.contact ............................. [INSERT 0 4 in 0.07s]
2024-10-18 17:50:33 17:50:33 4 of 9 START sql incremental model v1.data_record .............................. [RUN]
2024-10-18 17:50:34 17:50:34 4 of 9 OK created sql incremental model v1.data_record ......................... [INSERT 0 0 in 0.07s]
2024-10-18 17:50:34 17:50:34 5 of 9 START sql incremental model v1.user ..................................... [RUN]
2024-10-18 17:50:34 17:50:34 5 of 9 OK created sql incremental model v1.user ................................ [INSERT 0 1 in 0.06s]
2024-10-18 17:50:34 17:50:34 6 of 9 START sql incremental model v1.contact_type ............................. [RUN]
2024-10-18 17:50:34 17:50:34 6 of 9 OK created sql incremental model v1.contact_type ........................ [INSERT 0 4 in 0.05s]
2024-10-18 17:50:34 17:50:34 7 of 9 START sql incremental model v1.person ................................... [RUN]
2024-10-18 17:50:34 17:50:34 7 of 9 OK created sql incremental model v1.person .............................. [INSERT 0 2 in 0.09s]
2024-10-18 17:50:34 17:50:34 8 of 9 START sql incremental model v1.place .................................... [RUN]
2024-10-18 17:50:34 17:50:34 8 of 9 OK created sql incremental model v1.place ............................... [INSERT 0 2 in 0.06s]
2024-10-18 17:50:34 17:50:34 9 of 9 START sql incremental model v1.patient .................................. [RUN]
2024-10-18 17:50:34 17:50:34 9 of 9 OK created sql incremental model v1.patient ............................. [INSERT 0 2 in 0.06s]
2024-10-18 17:50:34 17:50:34
2024-10-18 17:50:34 17:50:34 1 of 1 START hook: cht_pipeline_base.on-run-end.0 .............................. [RUN]
2024-10-18 17:50:34 17:50:34 1 of 1 OK hook: cht_pipeline_base.on-run-end.0 ................................. [OK in 0.03s]
2024-10-18 17:50:34 17:50:34
2024-10-18 17:50:34 17:50:34 Finished running 9 incremental models, 1 project hook in 0 hours 0 minutes and 0.78 seconds (0.78s).
2024-10-18 17:50:34 17:50:34
2024-10-18 17:50:34 17:50:34 Completed successfully
2024-10-18 17:50:34 17:50:34
2024-10-18 17:50:34 17:50:34 Done. PASS=10 WARN=0 ERROR=0 SKIP=0 TOTAL=10
The only change I made in this repo was updating Dockerfile
to use the latest version of dbt-core
via:
RUN pip install --upgrade cffi \
&& pip install cryptography~=3.4 \
&& pip install dbt-core==1.9.0b1 dbt-postgres==1.9.0b1
Is there any extra configuration that you had locally, @njuguna-n? What was your setup?
I have just tried it again with this branch and it worked just like in your test. I am not sure what was different about my setup but I will try to recreate it once I am back on Monday (28th October). One thing I have noted is that the microbatch models are handled differently; in the failed run it runs a batch for a specific date range START batch 2024-10-14 of v1.document_metadata
but in the successful one it does not log the date START sql microbatch model v1.document_metadata
,.
It would be helpful to understand your setup:
@andrablaj I have tried to reproduce the error I got with no luck. I have changed the saved_timestamp
field in the medic
table to have different timestamps on different days and months and also tried various Python versions for the dbt container. I have also tried from a fresh database and with the tables already created and it all works. We still need to wait for the official 1.9 release but happy to go with this approach as opposed to the custom batching logic.
What feature do you want to improve? When syncing data for the first time, couch2pg does a good job of copying over a lot of data fast but that means that dbt struggles to keep up and can end up trying to load millions of rows incrementally which has led to issues such as this one.
Describe the improvement you'd like When the sync is running for the first time there should be a flag set so that either dbt or couch2pg can create the main base tables i.e.
document_metadata
,contact
, anddata_record
. These tables hold the majority of the data take the most time during dbt runs so pre-populating them would reduce the amount of time required for a sync to complete.Describe alternatives you've considered We have manually updated the document_metadata table and deleted some documents but these were temporary measures that should not be done for all production deployments.