manuzhang / read-it-now

Don't read it later; read it now
3 stars 0 forks source link

Delta Lake for data lake #35

Open manuzhang opened 5 years ago

manuzhang commented 5 years ago

In Spark+AI Summit 2019, Databricks announced their open source DeltaLake project, which brings ACID transactions, schema management and other nice features to your existing data lake.

One key feature is "Upsert"(thus the name, "delta"), which enables incremental updates of your data. Check out the blog post and slides for use cases and deep dive.

Note that at the time of writing, "Upsert" (or MERGE command) is not yet available in the open source repo (DeltaLake 0.2.0)

DataEngineering podcast interviewed Michael Armbrust, the lead architect of Delta Lake (also Spark SQL and Structured Streaming), over why and how Delta Lake has been built, the design patterns, and other interesting stories behind the scene.

Some interesting Q&As (moderated for brevity):

Q: How did you get involved in the area of data management? A: I picked randomly between SQL server and C# runtime when starting internship at Microsoft.

Q: What is the motivation for creating DeltaLake? A: It started as a conversation with an Apple engineer at Spark Summit, to solve his problems of archiving all monitoring data (PBs of data) at Apple and using them for detection and response.

Q: What are the benefits of a data lake over a data warehouse? A: I think the biggest benefit of a data lake over a data warehouse is the cost, the scale and the effort that it takes to ingest new data. It's actually great to start with these raw data sources, and then later, figure out how to clean them and make them useful for insight

Q: How do you manage schema evolution when working with large volumes of data? A: I like to think back to some rules that I learned when I was at Google, and how they evolved schema, because they really think about these things as contracts between people. Once a column has existed, it's a bad idea to rename it and reuse that name.

Q: The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach? A: Why people started with the Lambda achitecture was because there wasn't a way to do streaming and batch on a single set of data while maintaining exactly-once semantics. (I really enjoy the following streaming vs. batch discussion).

Streaming vs. batch

**Streaming is not about low latency. It's about incrementalization. It's about splitting the job up. But it's also about not having a schedule, not having to worry about dependencies amongst jobs, and making sure that they run at the right time and worrying what happens if this job finishes late. It's about figuring out what data is new since you process last. And it's also about failure management, what happens when your job crashes in the middle, and you need to restart it, streaming kind of takes care of all of those things automatically. It allows you to focus only on the data flow problem, so about your business logic, and not the system's problems of how do I actually run this thing efficiently when data is continually arriving. So that's why I think streaming is very powerful.** **But in spite of all of this, you know, batch jobs still happen, you may have retention requirements, where the business mandates that all data over two years old is deleted, you might have GDPR requirements where some user has served you with a DSR and you have 30 days to eliminate them from all of your data lake. Or you might have changed data capture coming from an existing operational store. So you have you know, your point of sale data that is coming in every day super important to your business. But you want to merge that into an analytical store, where you can run run kind of long term longitudinal analysis that you would never want to run on your actual operational store for performance reasons.**

And so bringing these together into a single interface, not requiring you to manage two completely separate pipelines, I think drastically simplifies the work of data engineers.

Q: What are some of the problems and opportunities is that you could have addressed but consciously decided not to work on? A: We learned it was a bad idea to implement our own processing in early versions. As shown by benchmark, Spark joins with whole stage codegen performed much better.

manuzhang commented 5 years ago

Starburst Data announced Presto Databricks Delta Lake compatibility for Presto to work seamlessly with Delta Lake.