treeverse / lakeFS

lakeFS - Data version control for your data lake | Git for data
https://docs.lakefs.io
Apache License 2.0
4.21k stars 337 forks source link

Support lakeFS URIs in Delta-rs + Polars #7268

Open ozkatz opened 6 months ago

ozkatz commented 6 months ago

Loading a Delta table using delta-rs's Python bindings from a lakefs:// URI currently fails:

>>> from deltalake import DeltaTable
>>> from lakefs_spec import LakeFSFileSystem
>>> 
>>> DeltaTable('lakefs://repository/branch/table/')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    DeltaTable('lakefs://repository/branch/table/')
  File ".../lib/python3.11/site-packages/deltalake/table.py", line 319, in __init__
    self._table = RawDeltaTable(
                  ^^^^^^^^^^^^^^
_internal.TableNotFoundError: No snapshot or version 0 found, perhaps /Users/ozkatz/lakefs:/repository/branch/table/ is an empty dir?

It appears that Delta-rs doesn't recognize the URI schema the same way it would for s3, adls and gcs, and defaults to assuming this is a local directory path.

Additionally, Polars' polars.read_delta() depends on Delta-rs, so polars + Delta is also broken:

>>> import polars as pl
>>>
>>> pl.read_csv('lakefs://repository/branch/file.csv')  # works!
>> pl.read_delta('lakefs://repository/branch/table/')  # errors!

The only workaround at the moment is to use the S3 gateway, which means data has to go through the lakeFS server (which in some cases is not possible due to security).

ion-elgreco commented 5 months ago

Hi @ozkatz, I am one of the maintainers at delta-rs. I am also looking at using LakeFS in azure with Polars and delta-rs :), so I am wondering how you currently even would be able to use lakefs with polars and azure as you say we don't support the lakefs::// uri.

Also supporting that lakefs:// uri likely requires that to be added upstream in the object_store crate which we use in delta-rs.

ozkatz commented 5 months ago

Hi @ion-elgreco, Thanks for the added context!

I'm no expert in Rust but from the looks of it, implementing the ObjectStore trait should be enough and doesn't necessarily require upstreaming that implementation into the object_store crate, is that correct?

ion-elgreco commented 5 months ago

That's true, perhaps you can create a lakefs_store crate which we then can use as dependency to add a deltalake-lakefs crate