pgvector / pgvector-python

pgvector support for Python
MIT License
975 stars 63 forks source link

Duplicate Search #85

Closed krleonie closed 4 months ago

krleonie commented 4 months ago

Hello!

We are currently trying to do a duplicate search, i. e. to find the most similar items for each item in the table. We do this by iterating over the table and SELECT all items which have a cosine similarity with the current item over a certain threshold. To accelerate this, we are using Indexes on the vector column. As our table has more than 2 mio rows, it still takes a very long time, as we do the similarity search 2 mio times for 2 mio items.

This is why we wonder if there is a better solution integrated in pgvector to find duplicates in a table (duplicate = items which have a similarity over a threshold)?

Thank you!

ankane commented 4 months ago

Hi @krleonie, I think the most efficient way would be to calculate a distance matrix between all vectors.

You could do this in SQL:

SELECT t.id, t2.id, t.embedding <=> t2.embedding AS distance
    FROM items t INNER JOIN items t2 ON t.id > t2.id
    WHERE (t.embedding <=> t2.embedding) < 0.001;

but I suspect a faster way is to export the data with COPY ... (FORMAT BINARY) and do this in-memory with a library that has optimized matrix operations (like NumPy).

Then for new records, you could do an ANN search in Postgres like you are now.