Uncorrect distance for hnswlib-spark

david-liu commented 3 years ago

Hi, I use hnswlib-spark for ANN search, and use inner-product for my normalized 128-dim embedding. But I found if I fit the full-size query item(> 10M) to the model, the similarity (1-inner-product) returned by the algorithm is not correct.

cosine_sim	actual_cosine_sim
0.9999999	0.8632334353951592
1.0	0.6545347158171353
0.99999976	0.5148564692935906
0.9999996	0.9999995576617948

But when only fit one query item , the similarity is right, and the model return totally different top-K items

The following code is my model configuration

new HnswSimilarity()
      .setIdentifierCol("item_id")
      .setQueryIdentifierCol("item_id")
      .setFeaturesCol("embeddings")
      .setNumPartitions(150)
      .setNumReplicas(5)
      .setK(30)
      .setEf(128)
      .setSimilarityThreshold(0.90)
      .setDistanceFunction("inner-product")
      .setPredictionCol("approximate")
      .setExcludeSelf(false)
      .setM(64)
      .setEfConstruction(200)

The version is hnswlib-spark_2.3.0_2.11:0.0.46

david-liu commented 3 years ago

I also try bruteForce, the following lines shows the incorrect row/total row for the distance

-- bruteForce: 43240/14432131

-- HNSW: 43693/14432131

jelmerk commented 3 years ago

Can you demonstrate the problem with a minimal dataset ?

And what do the other steps of the pipeline look like ?

I am pretty sure this works as we used it in production like this in my old job

david-liu commented 3 years ago

Thanks, after debugging the code, we found it is caused by the duplicate IdentifierCol with different embeddings. It means the model uses one embedding for ANN, which is different from the embedding we directly calculate the distance. After we remove the duplicated IdentifierCol and only keep the latest embedding for the item, it works

jelmerk commented 3 years ago

great, glad to hear cheers

jelmerk / hnswlib

Uncorrect distance for hnswlib-spark #36