lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.65k stars 195 forks source link

[EPIC] Polars support #1507

Open wjones127 opened 8 months ago

wjones127 commented 8 months ago
universalmind303 commented 8 months ago

Additionally, polars has a js library https://www.npmjs.com/package/nodejs-polars. It would be cool to add that same level of support to the lance js bindings.

wjones127 commented 8 months ago

That would be cool indeed. Related issue: https://github.com/lancedb/lancedb/issues/153

yliang412 commented 7 months ago

Hi @wjones127, I would like to give a try at the tasks mentioned in this issue. Could you assign me to this task?

Several initial questions:

  1. Would to_polars() include polars as an optional dependency?
  2. Do we want a DataFrame or a LazyFrame as the return type of to_polars?
  3. I will do more research on how polars handle projection and predicate pushdown in their lazy API, but does this feature requires anything to be done on the polars side?

Also looking for suggestions to start on the tasks. Thanks a lot!

wjones127 commented 7 months ago

Could you assign me to this task?

Sure, done.

Would to_polars() include polars as an optional dependency?

Yes. We'd like to make sure we don't need to import it until necessary. Related: #1217

Do we want a DataFrame or a LazyFrame as the return type of to_polars?

Our other APIs are eager right now, so I'd say DataFrame. But we could later add a to_polars_lazy() that returns a LazyFrame if we wanted, but I think getting the pushdown and such correct would take some work that we should defer for later.

I will do more research on how polars handle projection and predicate pushdown in their lazy API, but does this feature requires anything to be done on the polars side?

We might already be able to work somewhat via pyarrow Dataset API. Part of that implementation is here: https://github.com/pola-rs/polars/blob/64bd3455f0d837f888f2d967cc545e2444f844a8/py-polars/polars/io/pyarrow_dataset/anonymous_scan.py#L14

astrojuanlu commented 4 months ago

Is this already done? https://blog.lancedb.com/lancedb-polars-2d5eb32a8aa3/