lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.91k stars 215 forks source link

Is there any roadmap to add Velox readers/writers? #3017

Open jaystarshot opened 1 week ago

jaystarshot commented 1 week ago

Velox framework for vectorized processing - https://github.com/facebookincubator/velox

westonpace commented 1 week ago

I'm not aware of anyone planning to do this but it seems like an interesting project.

jaystarshot commented 1 week ago

I see what about arrow c++? If arrow c++ is already supported wrapping to velox formats shouldn't be that difficuilt

wjones127 commented 1 week ago

We integrate with PyArrow (which is based on Arrow C++) via the Arrow C Data Interface. So same could be used with Arrow C++ / Velox.

westonpace commented 1 week ago

Yes, I expect the tricky part would not be conversion of the data (since Velox and ourselves both speak the C data interface) but just building a C++ Velox plugin and aligning the various scan methods. Unfortunately, the last I heard, Velox had planned on dropping Substrait support and so the plugin may also need custom logic to convert from Velox expressions to Substrait expressions if they wanted to support pushdown filter. Although, since the linked issue is still open, it seems support hasn't been removed yet.

jaystarshot commented 6 days ago

Not sure if filter pushdown into scans is a concern for ML use cases, https://www.youtube.com/watch?v=bISBNVtXZ6M for example mentions that nimble doesn't yet have filter pushdowns

westonpace commented 6 days ago

That's a good point. Filter pushdown is most effective with clustered indices and that hasn't yet been a major use case for us either.