Closed Sutyke closed 2 years ago
Someone else can chime in but I imagine the first step here would be for Databricks to write a connector to Apache Arrow. Pandas uses Arrow for Parquet support. If Arrow were to support Delta then this would be a simpler add-in. You may wish to suggest this to them.
That said, this might not be possible, at least to support Delta for its primary use case of ACID transaction support. It appears the current connectors at least assume something that uses or can use a Hive metastore (i.e. Spark, Hive, Presto, Athena) to preserve the atomicity guarantees of the Delta Lake format. Delta Lake rewrites the Hive metastore files atomically to preserve its ACID guarantees. To my knowledge, Arrow itself is not Hive-metastore aware and so would not be a candidate for this; pandas certainly is not Hive-metastore aware. Pandas or Arrow could in theory read the underlying files as with Parquet, but since the point of Delta is to support ACID transactions, the metastore is crucial in a way that isn't true of a standard Parquet table. So if you need that support you're back to using something that can use a Hive metastore to read the files- i.e. Spark.
@Sutyke doesn't https://github.com/delta-io/delta-rs/tree/main/python#usage (i.e. https://pypi.org/project/deltalake/) already solve this use-case? See the examples in that readme on how to load a delta table into pandas dataframe. Delta-rs natively reads delta tables without the need of any other external dependencies or frameworks like JVM, Hive or Spark. It also preserves ACID guarantees during read/writes. The core of that deltalake pypi package is a full cross-platform Deltalake implementation in pure Rust.
https://github.com/delta-io/connectors on the other hand requires JVM and Hive. It's for accessing Delta tables outside of Spark from other JVM based big data query engines.
Perhaps we could add https://pypi.org/project/deltalake/ as an optional extra dependency to pandas itself to make deltalake support work out of the box for pandas users?
@houqp this would be great idea some way to add it as extra dependency.
@Liam3851 Currently for example for pandas.read_parquet there is parameter engine
. As delta lakes are actually parquet files, what about just to add extra engine deltalake
?
If engine will be set to deltalake it is used to read parquet files in delta lake format:
engine{‘auto’, ‘pyarrow’, ‘fastparquet’, new parameter : 'deltalake'
}, default ‘auto’
Thanks for the suggestion, but based on the response from core devs in https://github.com/pandas-dev/pandas/pull/49692, there isn't much appetite to maintain a read_deltalake
function (especially since we link to this in the ecosystem docs where the deltalake packagage provides a to_pandas
function https://pandas.pydata.org/docs/ecosystem.html#deltalake). Closing
I'd love it if Pandas could support Databricks' Delta Lake file format (https://github.com/delta-io/delta). It's a type of versioned parquet file format that supports updates/inserts/deletions.
In the past delta lake run only on spark but now there are connectors where spark is not required:
https://pypi.org/project/deltalake/ https://github.com/delta-io/connectors https://github.com/delta-io/delta-rs/tree/main/python