Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, with more integrations coming..
For every query, we call FileFragment::open(), which does maybe does some IO, schema manipulation, and then retrieves stuff from the metadata cache. Those first two will be optimized in #2420 and #2421. But if we are re-opening often enough for the same version of a dataset, it's perhaps worth considering caching these handles.
The main wrinkle is that these handles hold pointers to other data that is in the file metadata cache, which may pose issues for eviction policies (double counting size, evicting wrong items, etc.).
For every query, we call
FileFragment::open()
, which does maybe does some IO, schema manipulation, and then retrieves stuff from the metadata cache. Those first two will be optimized in #2420 and #2421. But if we are re-opening often enough for the same version of a dataset, it's perhaps worth considering caching these handles.The main wrinkle is that these handles hold pointers to other data that is in the file metadata cache, which may pose issues for eviction policies (double counting size, evicting wrong items, etc.).