timescale / pgvectorscale

A complement to pgvector for high performance, cost efficient vector search on large workloads.
PostgreSQL License
610 stars 23 forks source link

Post-fitering performance #87

Open hlinnaka opened 3 weeks ago

hlinnaka commented 3 weeks ago

How does the post-filtering perform compared to https://github.com/pgvector/pgvector/pull/282 and https://github.com/pgvector/pgvector/pull/524? Recall? Speed?

cevian commented 3 weeks ago

Unlike the current HNSW implementation, StreamingDiskANN has no recall degradation with post-filtering (actually that's the "Streaming" part of the algorithm. You can read more here: https://www.timescale.com/blog/how-we-made-postgresql-as-fast-as-pinecone-for-vector-data/ (Section " Support for streaming retrieval for accurate metadata filtering").

The same Streaming method is used with and without filters so there is no performance degradation per se. Although obviously for more selective queries more of the graph needs to be traversed.

Honestly, not sure how translatable the streaming approach is to hnsw, but am skeptical it's easy because of complications introduced by the multi-level stuff.

hlinnaka commented 3 weeks ago

(edited the link in the original question to point to correct PR)