Open Charantl opened 6 months ago
Following the offline conversions, here are a few points about this for posterity:
To summarize: The work to be done here is to add the feature to disable merge for some resource types and also do not copy their Parquet files in DWH snapshots (instead re-use one copy).
Incremental Pipeline process reads delta data from the source and merges it with the full load data in the filesystem -
On each incremental pipeline run, the entire data in the current DWH is scanned and merged. This merge causes version histories to be overwritten with the latest values. There's also an overhead of reading the entire data (existing data) on each pipeline run. This read could become expensive once the data grows (especially in cloud storage).
Also, an option to mitigate the ever-growing files through a data compaction job would be beneficial.