lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.98k stars 227 forks source link

Create `LanceTableFactory` implementing DataFusion's `TableProviderFactory` trait #3157

Open matthewmturner opened 4 days ago

matthewmturner commented 4 days ago

I would like to add Lance as a supported file type in dft similar to how we currently have deltalake and are working on hudi / Iceberg support. All of these formats are accessed via DataFusions TableProviderFactory. I see that TableProvider is already implemented so I am hoping that can be extended.

westonpace commented 4 days ago

I've made a quick PR to expose the table provider.

TableProviderFactory is a little interesting. I might need some help making sure I understand the various inputs.

Is the intention to open up an existing table from a location? Or is the intention to create a branch new empty table? Or both?

Does this sound correct?

matthewmturner commented 4 days ago

@westonpace appreciate your quick and thoughtful response.

The intent here is to be able to be able to write DDL like the following so that I can start reading the lance format (I believe the TableProviderFactory may also enable writing to the format but I think that would only be if that was implemented by the TableProvider (dont quote me on this though).

CREATE EXTERNAL TABLE my_table STORED AS LANCE LOCATION '/path/to/lance';

Here is an example of how we use the DeltaTableFactory for this purpose.

Unfortunately, I'm not that familiar with Lance semantics to be able to answer the specifics on how that maps to TableProviderFactory (but im hoping to start learning more about it - hence this issue ;) ). Here is some documentation on how it works though which can hopefully help.

To the extent its reasonable on your side i would think a v1 that only exposes the simplest functionality would be reasonable.