lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

feat: start recording index details in the mainifest, cache index type lookup #3131

Closed westonpace closed 6 days ago

westonpace commented 6 days ago

This addresses a specific problem. When a dataset had a scalar index on a string column we would perform I/O during the planning phase on every query that contained a filter. This added considerably latency (especially against S3) to query times.

We now cache that lookup.

It also starts to tackle a more central problem as well. Right now we our manifest stores very little information about indices (pretty much just the UUID). Any further information must be obtained by loading the index. This PR introduces the concept of "index details" which is a spot that an index can put index-specific (e.g. specific to btree or specific to bitmap) information that can be accessed during planning (by just looking at the manifest). At the moment this concept is still fairly bare bones but I think, as scalar indices become more sophisticated, this information can be useful.

If we decide we don't want it then I can pull it out as well and dial this PR back to just the caching component.

wjones127 commented 6 days ago

We had discussed earlier some similar index changes proposed here:

https://github.com/lancedb/lancedb/issues/1666

It looks like this is a good step in that direction by adding the index_config / index_details field 👍

codecov-commenter commented 6 days ago

Codecov Report

Attention: Patch coverage is 66.05505% with 37 lines in your changes missing coverage. Please review.

Project coverage is 77.90%. Comparing base (f257489) to head (e481fd4). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index/scalar.rs 67.34% 12 Missing and 4 partials :warning:
rust/lance/src/index/cache.rs 9.09% 10 Missing :warning:
rust/lance/src/index.rs 59.09% 4 Missing and 5 partials :warning:
rust/lance/src/dataset/scanner.rs 95.00% 1 Missing :warning:
rust/lance/src/io/commit.rs 66.66% 0 Missing and 1 partial :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #3131 +/- ## ========================================== - Coverage 77.91% 77.90% -0.01% ========================================== Files 240 240 Lines 81564 81459 -105 Branches 81564 81459 -105 ========================================== - Hits 63550 63464 -86 - Misses 14806 14815 +9 + Partials 3208 3180 -28 ``` | [Flag](https://app.codecov.io/gh/lancedb/lance/pull/3131/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/lancedb/lance/pull/3131/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb) | `77.90% <66.05%> (-0.01%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=lancedb#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.