Open westonpace opened 8 months ago
I thought the parameters were in dataset.stats.index_stats
?
(Pdb) from pprint import pprint
(Pdb) pprint(dataset.stats.index_stats(index_name))
{'index_type': 'IVF',
'indices': [{'centroids': [[0.5594622492790222,
...,
0.5300236940383911,
0.5513307452201843]],
'index_type': 'IVF',
'metric_type': 'l2',
'num_partitions': 2,
'partitions': [{'size': 238}, {'size': 274}],
'sub_index': {'dimension': 32,
'index_type': 'PQ',
'metric_type': 'l2',
'nbits': 8,
'num_sub_vectors': 1},
'uri': '/private/var/folders/09/h28jzzv164n6bn4ldrhhm73m0000gn/T/pytest-of-willjones/pytest-27/test_count_index_rows0/test/_indices/ef525f0b-4c87-42d9-9ace-3e2437b10c71/index.idx',
'uuid': 'ef525f0b-4c87-42d9-9ace-3e2437b10c71'}],
'name': 'a_idx',
'num_indexed_fragments': 1,
'num_indexed_rows': 512,
'num_indices': 1,
'num_unindexed_fragments': 0,
'num_unindexed_rows': 0}
I thought the parameters were in dataset.stats.index_stats?
@wjones127
They are but the statistics are experimental / unstable. Since index parameters are stable, we should have a stable way of retrieving them.
I'm mainly filing this because I want to be able to load the index config in LanceDb and I'm not sure we want to expose raw stats in LanceDb.
I hope we can make them more stable soon. IIRC the main impetus for exposing them is making it so users can retrieve and re-use the IVF centroids.
We also have some use cases where we need to check the metric_type. Agreed, would be nice to have a stable way of getting it.
Ideally the output from
load_indices
would allow to both knowToday I can kind of guess the first one based on the type of column but once we add more vector index types this will no longer be possible.
I have no way today of getting the parameters. This can be very useful because users may forget these things and want to examine them (e.g. because they've learned more about vector indices and now they want to know if they need to rebuild their index or not).