Open kylebarron opened 2 months ago
Yes, I think this is something we'd be happy to support.
On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch,
I agree that should probably be a RBR.
Maybe there are some classes that make sense to have arrow_c_stream defined on them directly? Maybe the LanceFragment? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.
Yeah I don't think that would makes sense. LanceFragment
doesn't represent in-memory data, just something on disk that can be scanned. I think it should instead just have a to_batches()
method, which I believe it does.
Edit: on top, it would also be awesome to integrate the pycapsule interface with LanceSchema
Yeah that would make a lot of sense.
On the read side, would you consider changing the return type of to_batches to something like a pyarrow.RecordBatchReader? This would potentially not even be a backwards incompatible change, because the RecordBatchReader still acts as an iterator of RecordBatch, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.
I don't remember if we return a RecordBatchReader
here already or not. However, if we don't, I agree we should be returning something that supports __arrow_c_stream__
. Other than the inputs / outputs, which I think you have covered (also, merge_insert
, and wherever else we accept / consume RBR), I'm not sure we have much else mapping to __arrow_c_stream__
.
I don't remember if we return a
RecordBatchReader
here already or not. However, if we don't, I agree we should be returning something that supports__arrow_c_stream__
.
If I'm reading this correctly, to_batches
currently returns a Python iterator
Since LanceSchema has pyarrow interop anyways, https://github.com/lancedb/lance/blob/fa089be2bf0457e6bf8a92c6ef67e43e4c0c3177/python/src/schema.rs#L47-L61
It might as well expose/ingest c schemas too. You could easily reuse the pyarrow dunders if you don't want to manage the rust FFI yourselves
👋 The Arrow project recently created the Arrow PyCapsule Interface, a new protocol for sharing Arrow data in Python. Among its goals is allowing Arrow data interchange without requiring the use of pyarrow, but I'm also excited about the prospect of an ecosystem that can share data only by the presence of dunder methods, where producer and consumer don't have to have prior knowledge of each other.
I'm trying to promote usage of this protocol throughout the Python Arrow ecosystem.
On the write side, through
write_dataset
, it looks likecoerce_reader
does not yet check for__arrow_c_stream__
. It would be awesome ifcoerce_reader
could check for__arrow_c_stream__
and just callpyarrow.RecordBatchReader.from_stream
. In the longer term, you could potentially remove the pyarrow dependency altogether, though I understand if that's not a priority.On the read side, would you consider changing the return type of
to_batches
to something like apyarrow.RecordBatchReader
? This would potentially not even be a backwards incompatible change, because theRecordBatchReader
still acts as an iterator ofRecordBatch
, but it also has the benefit of holding the Arrow iterator at the C level, so it can be passed to other compiled code without needing to iterate the Python loop.Maybe there are some classes that make sense to have
__arrow_c_stream__
defined on them directly? Maybe theLanceFragment
? It might not make sense if there are still required parameters to materialize an Arrow stream, like a column projection or an expression.Edit: on top, it would also be awesome to integrate the pycapsule interface with
LanceSchema