timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
1.36k stars 57 forks source link

Support for matryoshka indexing #131

Open npip99 opened 2 months ago

npip99 commented 2 months ago
CREATE INDEX ix_chunk_embedding
ON chunk USING diskann (embedding) WITH (num_dimensions=1999);
NOTICE:  Starting index build. num_neighbors=-1 search_list_size=100, max_alpha=1.2, storage_layout=SbqCompression
ERROR:  assertion failed: dimensions > 0 && dimensions < 2000

The error above is a bit of a shame.

If my vector is a Vector(3072), it would be nice to support matryoshka by allowing the dimension of the index to be < 2000, even if the source vector has a larger dimension. I believe the above SQL code should execute successfully, since I'm only indexing a subvector of the original vector.

For now, I have a generated column and calculate it based on my desired subvector, but this takes physical space on disk, when ideally it should be computed on the fly. And, it means that I have to rerank manually by the full vector, rather than the index automatically handling it (Not a big deal).

If it could support e.g. this notation, then the num_dimensions attribute wouldn't be necessary anymore, and solve both problems (But I think supporting that notation might be overkill, I'm not sure).

cevian commented 2 months ago

Oh yeah this seems to be something we overlooked