Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
In [12]: %timeit ds_s3.scanner(limit=100).explain_plan()
281 μs ± 682 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [13]: %timeit ds_s3.scanner(filter="true", limit=100).explain_plan()
24.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [14]: %timeit ds_local.scanner(limit=100).explain_plan()
275 μs ± 623 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [15]: %timeit ds_local.scanner(filter="true", limit=100).explain_plan()
831 μs ± 3.17 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
planning query with filter is a lot slower than a query without filter. This is especially obvious over a dataset on S3. This suggests that we are likely doing IO repeatedly and should instead cache the IO.
I believe the I/O is done when we lookup which columns have eligible scalar indices. I think there's a caching opportunity here we are probably missing.
repro
planning query with filter is a lot slower than a query without filter. This is especially obvious over a dataset on S3. This suggests that we are likely doing IO repeatedly and should instead cache the IO.