lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.82k stars 212 forks source link

Support for Arrow PyCapsule Interface #2630

Open kylebarron opened 2 months ago

kylebarron commented 2 months ago

👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.

I'm trying to promote usage of this protocol throughout the Python Arrow ecosystem.

On the write side, through write_dataset, it looks like coerce_reader does not yet check for __arrow_c_stream__. It would be awesome if coerce_reader could check for __arrow_c_stream__ and just call pyarrow.RecordBatchReader.from_stream. In the longer term, you could potentially remove the pyarrow dependency altogether, though I understand if that's not a priority.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

Maybe there are some classes that make sense to have __arrow_c_stream__ defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

wjones127 commented 2 months ago

Yes, I think this is something we'd be happy to support.

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch,

I agree that should probably be a RBR.

Maybe there are some classes that make sense to have arrow_c_stream defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.

Yeah I don't think that would makes sense. LanceFragment doesn't represent in-memory data, just something on disk that can be scanned. I think it should instead just have a to_batches() method, which I believe it does.

Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema

Yeah that would make a lot of sense.

westonpace commented 2 months ago

On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__. Other than the inputs / outputs, which I think you have covered (also, merge_insert, and wherever else we accept / consume RBR), I'm not sure we have much else mapping to __arrow_c_stream__.

kylebarron commented 2 months ago

I don't remember if we return a RecordBatchReader here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__.

If I'm reading this correctly, to_batches currently returns a Python iterator

https://github.com/lancedb/lance/blob/fa089be2bf0457e6bf8a92c6ef67e43e4c0c3177/python/python/lance/dataset.py#L2345-L2346

kylebarron commented 2 months ago

Since LanceSchema has pyarrow interop anyways, https://github.com/lancedb/lance/blob/fa089be2bf0457e6bf8a92c6ef67e43e4c0c3177/python/src/schema.rs#L47-L61

It might as well expose/ingest c schemas too. You could easily reuse the pyarrow dunders if you don't want to manage the rust FFI yourselves