lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.63k stars 193 forks source link

Cache open fragments #2422

Open wjones127 opened 1 month ago

wjones127 commented 1 month ago

For every query, we call FileFragment::open(), which does maybe does some IO, schema manipulation, and then retrieves stuff from the metadata cache. Those first two will be optimized in #2420 and #2421. But if we are re-opening often enough for the same version of a dataset, it's perhaps worth considering caching these handles.

The main wrinkle is that these handles hold pointers to other data that is in the file metadata cache, which may pose issues for eviction policies (double counting size, evicting wrong items, etc.).

wjones127 commented 1 month ago

flamegraph2

Here is the flamegraph where we measured this impact. This is for loading a single fragment, where the IOPS were already cached.