Lance scalar index search loads dataset metadata (which should be cached)

lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..

https://lancedb.github.io/lance/

Apache License 2.0

3.37k stars 175 forks source link

Lance scalar index search loads dataset metadata (which should be cached) #2313

Closed westonpace closed 2 weeks ago

westonpace commented 3 weeks ago

Tracking the I/O of a scalar index search we can see that the search is loading the dataset metadata (it appears to do so twice). That data should already be cached. Loading it can be quite costly and defeat the purpose of doing an indexed search in the first place. This gets even worse when there are many fragments in a dataset because the manifest is quite large.