pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.93k stars 18.03k forks source link

ENH: Delta Lake file format support #40573

Closed Sutyke closed 2 years ago

Sutyke commented 3 years ago

I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.

In the past delta lake run only on spark but now there are connectors where spark is not required:

https://pypi.org/project/deltalake/ https://github.com/delta-io/connectors https://github.com/delta-io/delta-rs/tree/main/python

Liam3851 commented 3 years ago

Someone else can chime in but I imagine the first step here would be for Databricks to write a connector to Apache Arrow. Pandas uses Arrow for Parquet support. If Arrow were to support Delta then this would be a simpler add-in. You may wish to suggest this to them.

That said, this might not be possible, at least to support Delta for its primary use case of ACID transaction support. It appears the current connectors at least assume something that uses or can use a Hive metastore (i.e. Spark, Hive, Presto, Athena) to preserve the atomicity guarantees of the Delta Lake format. Delta Lake rewrites the Hive metastore files atomically to preserve its ACID guarantees. To my knowledge, Arrow itself is not Hive-metastore aware and so would not be a candidate for this; pandas certainly is not Hive-metastore aware. Pandas or Arrow could in theory read the underlying files as with Parquet, but since the point of Delta is to support ACID transactions, the metastore is crucial in a way that isn't true of a standard Parquet table. So if you need that support you're back to using something that can use a Hive metastore to read the files- i.e. Spark.

houqp commented 3 years ago

@Sutyke doesn't https://github.com/delta-io/delta-rs/tree/main/python#usage (i.e. https://pypi.org/project/deltalake/) already solve this use-case? See the examples in that readme on how to load a delta table into pandas dataframe. Delta-rs natively reads delta tables without the need of any other external dependencies or frameworks like JVM, Hive or Spark. It also preserves ACID guarantees during read/writes. The core of that deltalake pypi package is a full cross-platform Deltalake implementation in pure Rust.

https://github.com/delta-io/connectors on the other hand requires JVM and Hive. It's for accessing Delta tables outside of Spark from other JVM based big data query engines.

houqp commented 3 years ago

Perhaps we could add https://pypi.org/project/deltalake/ as an optional extra dependency to pandas itself to make deltalake support work out of the box for pandas users?

Sutyke commented 3 years ago

@houqp this would be great idea some way to add it as extra dependency. @Liam3851 Currently for example for pandas.read_parquet there is parameter engine. As delta lakes are actually parquet files, what about just to add extra engine deltalake?

If engine will be set to deltalake it is used to read parquet files in delta lake format: engine{‘auto’, ‘pyarrow’, ‘fastparquet’, new parameter : 'deltalake'}, default ‘auto’

mroeschke commented 2 years ago

Thanks for the suggestion, but based on the response from core devs in https://github.com/pandas-dev/pandas/pull/49692, there isn't much appetite to maintain a read_deltalake function (especially since we link to this in the ecosystem docs where the deltalake packagage provides a to_pandas function https://pandas.pydata.org/docs/ecosystem.html#deltalake). Closing