perf: slow planning when filter is present

chebbyChefNEQ commented 1 week ago

repro

In [12]: %timeit ds_s3.scanner(limit=100).explain_plan()
281 μs ± 682 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [13]: %timeit ds_s3.scanner(filter="true", limit=100).explain_plan()
24.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [14]: %timeit ds_local.scanner(limit=100).explain_plan()
275 μs ± 623 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [15]: %timeit ds_local.scanner(filter="true", limit=100).explain_plan()
831 μs ± 3.17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

planning query with filter is a lot slower than a query without filter. This is especially obvious over a dataset on S3. This suggests that we are likely doing IO repeatedly and should instead cache the IO.

westonpace commented 6 days ago

I believe the I/O is done when we lookup which columns have eligible scalar indices. I think there's a caching opportunity here we are probably missing.

chebbyChefNEQ commented 3 days ago

fixed by https://github.com/lancedb/lance/pull/3131

lancedb / lance

perf: slow planning when filter is present #3127