Open mertalev opened 5 days ago
Thank you for pointing out the issue. Our README doesn't provide detailed explanations, and we will work on improving it. Regarding IVF and HNSW, I believe they should also be included in the README. What do you think, @VoVAllen?
We open sources in a hurry and will write more blogs about implementation details and performance comparisons in the coming weeks.
Why IVF can be faster than HNSW is a bit long story. But I'll cover the essentials here.
Most of the time spent by vector search algorithms is on distance computation. To improve speed, we need to minimize distance comparisons as much as possible. The original IVF performs poorly in this regard, typically requiring a scan of 1% to 5% of the total vectors, which is significantly more than HNSW.
Quantization introduces a new approach, allowing you to compress a 32-bit vector into just 1 bit (like RaBitQ) or even less (like PQ in Faiss). While this compression results in some loss of precision, it greatly reduces computational requirements. With the fast scan optimization, we can achieve computations that are over 32 times faster than traditional vector distance calculations. We can then rerank additional vectors to achieve a better recall rate. The full precision distance computation is only necessary during the reranking phase. This is why IVF can be faster than HNSW.
The same technique can also be applied to HNSW. We haven't tested it, but I believe it won't be significantly faster than using IVF. Fastscan requires packing 32 compressed vectors together, which isn't feasible with HNSW. The main time cost will lie in the reranking process. Since the quantization representations in IVF and HNSW are similar, the rerank computation cost will also be comparable. Therefore, the final performance difference is likely to be minimal.
As the index, IVF is much simpler and more efficient than HNSW, particularly for insertion and deletion. In HNSW, these operations require extensive computation and write amplification due to the cascading modifications throughout the entire graph. In contrast, IVF only involves inserting or deleting from a posting list.
Here's some early benchmark results on GIST 960-dim dataset with 1M vectors. The red one is pgvector HNSW and blue one is vectorchord. We'll post more about it later.
Elasticsearch have written two good blogs about how the RaBitQ works. link link
We plan to suggest users migrate to vchord at the end of this year when it becomes more mature. The vchord implementation is more compatible with pgvector, which should reduce issues with cnpg and upgrade problems. And it's fully compatible with pgvector (and based on pgvector actually for the vector types), so immich user can choose from the vendor they like easier I think.
That sounds very promising! An important aspect for Immich is the filtering capability. Is vchord the same in this regard as pgvecto.rs?
@mertalev Yes. VBASE also works on vectorchord out of the box.
As the index, IVF is much simpler and more efficient than HNSW, particularly for insertion and deletion. In HNSW, these operations require extensive computation and write amplification due to the cascading modifications throughout the entire graph. In contrast, IVF only involves inserting or deleting from a posting list.
Based on this, I have another question, more generally about IVF vs HNSW:
A pain point with the current HNSW indices is that duplicate embeddings in the index make the recall drop considerably. This can happen when a row is updated and a new tuple with the same embedding is inserted into the index. It can also happen when someone increases the face detection threshold which removes face embeddings, then changes it back to add the same face embeddings again.
I've restructured the data model and try to do a diff comparison before DB updates to minimize duplicate entries, but I'm wondering if this is less of an issue with an IVF index since updates and deletes are less invasive.
Restructure may not help due to MVCC in postgres. If you store the embedding with other columns, even if you don't update the embedding, just update other columns, it will still insert the same embedding into the index again, because of MVCC. So you might need to move everything out of the embedding table and use join for such scenario.
And for your question, I don't think IVF will have the problem you describe. Duplicate vectors just means multiple same entries will be retrieved at the quantization stage, and might result in multiple rerank, but won't have any effect on recall.
And for the typical use case of immich (less than 100k vectors), I think it also works with quantization only (set nlist to 1). It will only use 100000/32/2=1562 pages=12MB (let's say for 768dim vec) for the quantized vectors. The memory requirements will also be lower.
So you might need to move everything out of the embedding table and use join for such scenario.
Yes, I changed it so that embeddings go into their own separate table.
And for your question, I don't think IVF will have the problem you describe. Duplicate vectors just means multiple same entries will be retrieved at the quantization stage, and might result in multiple rerank, but won't have any effect on recall.
Thanks for clarifying! That's good to hear.
And for the typical use case of immich (less than 100k vectors), I think it also works with quantization only (set nlist to 1). It will only use 100000/32/2=1562 pages=12MB (let's say for 768dim vec) for the quantized vectors. The memory requirements will also be lower.
Interesting idea. I think a challenge there is that we don't know how many assets a user will have beforehand. We could possibly start with this and occasionally check if the number of tuples in the index is getting too high, then reindex with different settings if that's the case?
This is a potential limitation of vchordrq index. It currently doesn't allow us to dynamically change nlist based on the user's data growth. However, for immich's workload, I believe nlist=1 can accommodate up to 1M vectors. It may be slightly slower, but it will still be less than 100 ms for latency I think. We can do a simple experiments with 1M 960dim GIST data as the performance reference. What's the typical vector dimension in immich?
The dimension size is typically 512 - the default CLIP model is 512 and the facial recognition model is always 512. I believe the CLIP model with the highest dimension size we support is 1152, but most people don't change the model.
Hi! I noticed this project and that it's positioned as a successor to pgvecto.rs. I'm interested to learn more about how it compares in terms of performance and recall.
I also see that it only supports IVF. Doesn't IVF normally offer a worse recall/speed tradeoff than HNSW?
Lastly, what does this mean for pgvecto.rs? Will it continue to be supported, or are users expected to migrate to vchord?