Closed krleonie closed 4 months ago
Hi @krleonie, I think the most efficient way would be to calculate a distance matrix between all vectors.
You could do this in SQL:
SELECT t.id, t2.id, t.embedding <=> t2.embedding AS distance
FROM items t INNER JOIN items t2 ON t.id > t2.id
WHERE (t.embedding <=> t2.embedding) < 0.001;
but I suspect a faster way is to export the data with COPY ... (FORMAT BINARY)
and do this in-memory with a library that has optimized matrix operations (like NumPy).
Then for new records, you could do an ANN search in Postgres like you are now.
Hello!
We are currently trying to do a duplicate search, i. e. to find the most similar items for each item in the table. We do this by iterating over the table and SELECT all items which have a cosine similarity with the current item over a certain threshold. To accelerate this, we are using Indexes on the vector column. As our table has more than 2 mio rows, it still takes a very long time, as we do the similarity search 2 mio times for 2 mio items.
This is why we wonder if there is a better solution integrated in pgvector to find duplicates in a table (duplicate = items which have a similarity over a threshold)?
Thank you!