lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

perf: slow planning when filter is present #3127

Closed chebbyChefNEQ closed 3 days ago

chebbyChefNEQ commented 1 week ago

repro

In [12]: %timeit ds_s3.scanner(limit=100).explain_plan()
281 μs ± 682 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [13]: %timeit ds_s3.scanner(filter="true", limit=100).explain_plan()
24.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit ds_local.scanner(limit=100).explain_plan()
275 μs ± 623 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]: %timeit ds_local.scanner(filter="true", limit=100).explain_plan()
831 μs ± 3.17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

planning query with filter is a lot slower than a query without filter. This is especially obvious over a dataset on S3. This suggests that we are likely doing IO repeatedly and should instead cache the IO.

westonpace commented 6 days ago

I believe the I/O is done when we lookup which columns have eligible scalar indices. I think there's a caching opportunity here we are probably missing.

chebbyChefNEQ commented 3 days ago

fixed by https://github.com/lancedb/lance/pull/3131