lancedb / lance

Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..
https://lancedb.github.io/lance/
Apache License 2.0
3.97k stars 224 forks source link

`load_indices` should include index config #2039

Open westonpace opened 8 months ago

westonpace commented 8 months ago

Ideally the output from load_indices would allow to both know

Today I can kind of guess the first one based on the type of column but once we add more vector index types this will no longer be possible.

I have no way today of getting the parameters. This can be very useful because users may forget these things and want to examine them (e.g. because they've learned more about vector indices and now they want to know if they need to rebuild their index or not).

wjones127 commented 8 months ago

I thought the parameters were in dataset.stats.index_stats?

(Pdb) from pprint import pprint
(Pdb) pprint(dataset.stats.index_stats(index_name))
{'index_type': 'IVF',
 'indices': [{'centroids': [[0.5594622492790222,
                             ...,
                             0.5300236940383911,
                             0.5513307452201843]],
              'index_type': 'IVF',
              'metric_type': 'l2',
              'num_partitions': 2,
              'partitions': [{'size': 238}, {'size': 274}],
              'sub_index': {'dimension': 32,
                            'index_type': 'PQ',
                            'metric_type': 'l2',
                            'nbits': 8,
                            'num_sub_vectors': 1},
              'uri': '/private/var/folders/09/h28jzzv164n6bn4ldrhhm73m0000gn/T/pytest-of-willjones/pytest-27/test_count_index_rows0/test/_indices/ef525f0b-4c87-42d9-9ace-3e2437b10c71/index.idx',
              'uuid': 'ef525f0b-4c87-42d9-9ace-3e2437b10c71'}],
 'name': 'a_idx',
 'num_indexed_fragments': 1,
 'num_indexed_rows': 512,
 'num_indices': 1,
 'num_unindexed_fragments': 0,
 'num_unindexed_rows': 0}
westonpace commented 8 months ago

I thought the parameters were in dataset.stats.index_stats?

@wjones127

They are but the statistics are experimental / unstable. Since index parameters are stable, we should have a stable way of retrieving them.

I'm mainly filing this because I want to be able to load the index config in LanceDb and I'm not sure we want to expose raw stats in LanceDb.

wjones127 commented 8 months ago

I hope we can make them more stable soon. IIRC the main impetus for exposing them is making it so users can retrieve and re-use the IVF centroids.

albertlockett commented 8 months ago

We also have some use cases where we need to check the metric_type. Agreed, would be nice to have a stable way of getting it.