medic / cht-sync

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
GNU General Public License v3.0
2 stars 3 forks source link

feat(#78): full refresh on changed objects, only incremental runs con… #94

Closed witash closed 1 month ago

witash commented 2 months ago

…tinously

@njuguna-n and @lorerod what do you think about this:

This change addresses the issue of dropping views during the continuous update and the related issue of updating changed incremental tables https://github.com/medic/cht-pipeline/issues/78 dbt has a selector for changed models only. the complication is that it requires a manifest file to know what is changed. dataemon is already using the database to save metadata about the package, so this seems like a natural place to also add this manifest, although it is a bit messy since its a big json.

So, whenever the dbt container starts, it loads the last manifest from the db into the old_manifest directory then, it generates the new manifest, and saves that back to the db then it runs dbt using the "state:modified" selector, and the old manifest, to do a full refresh of anything has changed since the last run. also runs using the "config.materialized:view" selector to make sure any views that have not changed but still need to be created are up to date. During the the loop where it runs continuously, it excludes views and only updates the incremental tables

this will simplify the design of models; because the views are no longer dropped, it removes the requirement for dashboards to only read from materialized tables, and means the duration of the dbt run to update incremental tables only affects the incremental tables themselves.

if we can use views effectively, we can shift the strategy for what should be an incremental table and what should be a view to what dbt recommends; start with views, and only convert things to incremental tables when performance becomes a problem, or when downstream model require indexes. Having everything be incremental tables could also work but means that every model requires additional logic, can become out of date, and requires a lengthy full refresh when its changed.

njuguna-n commented 2 months ago

@witash this is a great idea. I will finish my review on Monday if that's okay.