google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
151 stars 84 forks source link

Feature Request: Requires Delta Data to be retained in incremental Snapshots #1013

Open Charantl opened 6 months ago

Charantl commented 6 months ago

Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -

On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).

Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.

bashir2 commented 4 months ago

Following the offline conversions, here are a few points about this for posterity:

To summarize: The work to be done here is to add the feature to disable merge for some resource types and also do not copy their Parquet files in DWH snapshots (instead re-use one copy).