Closed david-liu closed 3 years ago
I also try bruteForce, the following lines shows the incorrect row
/total row
for the distance
-- bruteForce: 43240/14432131
-- HNSW: 43693/14432131
Can you demonstrate the problem with a minimal dataset ?
And what do the other steps of the pipeline look like ?
I am pretty sure this works as we used it in production like this in my old job
Thanks, after debugging the code, we found it is caused by the duplicate IdentifierCol
with different embeddings. It means the model uses one embedding for ANN, which is different from the embedding we directly calculate the distance. After we remove the duplicated IdentifierCol
and only keep the latest embedding for the item, it works
great, glad to hear cheers
Hi, I use hnswlib-spark for ANN search, and use inner-product for my normalized 128-dim embedding. But I found if I fit the full-size query item(> 10M) to the model, the
similarity
(1-inner-product
) returned by the algorithm is not correct.But when only fit one query item , the similarity is right, and the model return totally different top-K items
The following code is my model configuration
The version is
hnswlib-spark_2.3.0_2.11:0.0.46