Support run-to-run deltas

mikix commented 1 year ago

Currently, each time we are run, we read all data and rewrite all output data. This increases the run time and puts unnecessary stress on bulk export servers.

Presumably, we could use a _since operator for bulk exports to capture everything since the last run.

Considerations when fixing this:

Storing the last successful run date.
Using just newly changed data as input. _since= is the obvious answer for bulk exports, but I hear some EHR systems (like Epic?) don't record last modified date (if so, do they even support _since?). For non-bulk-exports (csv or ndjson files sitting there, maybe we don't worry about this?)
Is ther a reliable way to detect which version of a resource is the newest one to use? We could insert a run time column. Or use Meta.versionId (but I hear that's not reliably supported in EHRs). (Currently, the ETL just assumes whichever input data you give it is the latest, and that might continue to work for us.)
Would we want to do the occasional consolidation, for performance reasons and/or do a full export again?

mikix commented 1 year ago

OK here's my stab at a comparison between the big-three ACID data lake formats: Delta Lake, Hudi, and Iceberg.

Features

Our needs are fairly modest. We want to occasionally atomically upsert (update/insert plus we'll also need delete support) into a data lake. We likely write less often than we read, and our reads are complicated.

All thee ACID lakes basically all have extremely similar features and are converging on each other. See this comparison article with some history in it -- but honestly just google around for acid/incremental/upsert data lake or the product names and you'll come to many comparison articles all talking about fairly minor differences (to my naive eyes).

They all do copy-on-write (slow write, but fast reads) vs merge-on-read (offered by Hudi), so that's fine. We want our Athena reads to be performant, since they happen a lot.

They are all open-source projects, growing quickly, managed by either the Apache or Linux Foundation. Data Lake does however tend to lag behind its proprietary Delta Engine commercial solution offered by DataBricks.

The most interesting feature for us that I saw was that Iceberg does some automatic partitioning for you. So better performance without having to manually determine partitions.

Python

Cumulus ETL is Python based. All these ACID lakes are Sparks (Java) based. There is pysparks, but that just downloads some jar files behind the scenes.

Hudi doesn't look like it has any native Python bindings coming down the pipeline at all. But both Delta Lake and Iceberg have up and coming (but not feature complete) native Python implementations (meaning no Java).

Delta Lake: 1st party Rust implementation with a Python wrapper. Currently missing merges, which is a critical feature for us. It also has a Sparks-based wrapper that adds a tiny amount of syntactic sugar over bare pysparks.

Iceberg: 1st party pure Python implementation. Currently missing merge support too.

So I'm much more interested in either Delta Lake or Iceberg, all else being equal, because we can eventually drop the behind-the-scenes-Java piece of the pipeline, which while not an active problem is certainly odd and makes things hard to debug when there are problems.

Cloud Support

We'd like to avoid being too dependent on one cloud platform (so no managed solutions), but also mindful of not being too difficult for future expansion as we add more platforms to the fold, beyond our current target of AWS.

AWS: It seems to have support for all three (as well as a managed Lake Formation). It has much more documentation for Iceberg, which makes me think it is better supported, but Delta Lake and Hudi get mentions too.

Azure: They seem completely to be a Delta Lake shop, with their big data lake support even being called Azure Databricks. The most documentation I could find about Iceberg was how to convert into Delta Lake.

Conclusion

Disclaimer: I myself am not an expert in this domain and I may have missed important facts. But assuming they all have comparable features, I'm inclined to disregard Hudi for its lack of a native Python solution and favor Delta Lake to make eventually supporting Azure even easier.

Thoughts from folks more knowledgeable?

(Also, nothing says we couldn't grow support for multiple data lakes in the future. I'm just focused on what to build first and primarily support.)

mikix commented 1 year ago

Regarding ACID lakes, we did end up going with Delta Lake and that part of this story is resolved. The remaining bits of this issue are about the _since parameter and growing fancy logic around only doing what we need to do, automatically (rather than offloading that to the user)

smart-on-fhir / cumulus-etl

Support run-to-run deltas #75