lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.87k stars 213 forks source link

Pushdown limit/offset is doing the wrong thing in the presence of deletes #2894

Closed westonpace closed 1 month ago

westonpace commented 1 month ago

For example, when we are asked to grab (offset=2K, limit=10K) rows from a fragment that has 5K physical rows and 1K deleted rows then we do correctly determine that we can only grab 2K rows. However, we then grab the first 2K rows and apply deletions on top of this.

Instead we need to scan the deletion vector to figure out the offset (2K + X where X is something like the number of rows deleted before 2K) and we need to grab more rows than asked for so that we have the correct number after deletion (4K + Y where Y is something like the number of deleted rows)

This is slightly slower since we can no longer parallelize the read of the data and the read of the deletion vector (must have deletion vector before reading data) but it shouldn't be drastically slower and is probably the best we can do for range reads atm. We can do some more sophisticated things to speed this up should it ever become a bottleneck.

westonpace commented 1 month ago

Closed by #2895