Duplicate Search - Githubissues

pgvector / pgvector-python

pgvector support for Python

MIT License

975 stars 63 forks source link

Hi @krleonie, I think the most efficient way would be to calculate a distance matrix between all vectors.

You could do this in SQL:

SELECT t.id, t2.id, t.embedding <=> t2.embedding AS distance
    FROM items t INNER JOIN items t2 ON t.id > t2.id
    WHERE (t.embedding <=> t2.embedding) < 0.001;

but I suspect a faster way is to export the data with COPY ... (FORMAT BINARY) and do this in-memory with a library that has optimized matrix operations (like NumPy).

Then for new records, you could do an ANN search in Postgres like you are now.

pgvector / pgvector-python

Duplicate Search #85