jelmerk / hnswlib

Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs
Apache License 2.0
255 stars 56 forks source link

Poor results for similarity match #23

Closed anuj-bansal closed 3 years ago

anuj-bansal commented 4 years ago

seeing very poor results with the algorithm and not sure what i am doing wrong. I have generated a test data set with solid color rectangles and then comparing it with a dataset of mixed images and solid color rectangles. my dimensions are 2048 hnsw = HnswSimilarity(identifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', m=90, ef=5, k=5, efConstruction=200, numPartitions=10, similarityThreshold=0.4, excludeSelf=True, predictionCol='approximate') i have tried it multiple times with different paramters however i am not getting good results. i have tried ef as 200 as well however increasing the ef gives lower accuracy rate than ef closer to k. will appreciate some guidance

jelmerk commented 4 years ago

It's hard to say without seeing the data, but a higher ef should never give lower accuracy

anuj-bansal commented 3 years ago

please see the notebook @ https://1drv.ms/u/s!AomIzr4F_Lm7o6F9XM4enR3XIpgr0w - it is generating a data set of solid colored blocks and gradient blocks then comparing those with 3 images
ef and efconstuction is set as variable Variable: 100 Accuracy: 0.2 Variable: 200 Accuracy: 0.4 Variable: 300 Accuracy: 0.5 Variable: 400 Accuracy: 0.45 Variable: 500 Accuracy: 0.45 Variable: 600 Accuracy: 0.35 Variable: 700 Accuracy: 0.45 Variable: 800 Accuracy: 0.5 Variable: 900 Accuracy: 0.35 Variable: 1000 Accuracy: 0.4

anuj-bansal commented 3 years ago

Can you please look at the code I just attached – not sure if you will look into a closed item – thus sending email as well

From: Jelmer Kuperus notifications@github.com Sent: Saturday, October 10, 2020 1:48 AM To: jelmerk/hnswlib hnswlib@noreply.github.com Cc: anuj-bansal anujb@hotmail.com; Author author@noreply.github.com Subject: Re: [jelmerk/hnswlib] Poor results for similarity match (#23)

Closed #23https://github.com/jelmerk/hnswlib/issues/23.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jelmerk/hnswlib/issues/23#event-3862539961, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVDCQWVJGFL5ZXMY4QFBZLSKANTBANCNFSM4SCH6LIQ.

jelmerk commented 3 years ago

Sorry I don't think i receive notifications for closed issues

I quickly looked at your notebook, but if i execute

from pyspark.sql.functions import col, posexplode

brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine',
                       k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')

model = brute_force.fit(shapes_df)

model.transform(sample_data_df).select(col('id'), posexplode('exact')).show(1000, False)

To see the actual distances. I get something like

+-------+---+------------------------------+
|id     |pos|col                           |
+-------+---+------------------------------+
|6000002|0  |[5000250, 0.8234763394288819] |
|6000002|1  |[5000200, 0.8310418384742924] |
|6000002|2  |[5000150, 0.8319127749920988] |
|6000002|3  |[4000000, 0.832754180609055]  |
|6000002|4  |[5000100, 0.8374621312616415] |
|6000001|0  |[4000240, 0.19548378187002968]|
|6000001|1  |[4000220, 0.22794030261194886]|
|6000001|2  |[4000180, 0.32675403485435484]|
|6000001|3  |[4000200, 0.35895343269006985]|
|6000001|4  |[4000140, 0.3647760514009075] |
|6000000|0  |[4000000, 0.8439331796919047] |
|6000000|1  |[4000020, 0.8503785879823388] |
|6000000|2  |[4000040, 0.8581594365981059] |
|6000000|3  |[4000200, 0.8613001241683248] |
|6000000|4  |[4000060, 0.8642670991707682] |
|6000003|0  |[5000200, 0.8736204199114561] |
|6000003|1  |[5000150, 0.8778530512884285] |
|6000003|2  |[5000250, 0.8786221170517285] |
|6000003|3  |[5000100, 0.8887594120234005] |
|6000003|4  |[4000100, 0.8897079131598035] |
+-------+---+------------------------------+

As you can see the distances are very large. thus your query set is not very similar to anything. This makes it very hard to get good quality approximative results

jelmerk commented 3 years ago

Just to clarify, because this tends to confuse datascientists, the values are cosine distances, not similarities. The similarity is obtained with 1 - cosine distance

anuj-bansal commented 3 years ago

increment = 100*2 brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')

HNSW

hnsw = HnswSimilarity(identifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', m=90, ef=increment, k=5, efConstruction=increment, numPartitions=50, similarityThreshold=-1.0, excludeSelf=True, predictionCol='approximate') pipeline = Pipeline(stages=[hnsw, brute_force]) model = pipeline.fit(shapes_df) output = model.transform(sample_data_df) evaluator = KnnSimilarityEvaluator(approximateNeighborsCol='approximate', exactNeighborsCol='exact') accuracy = evaluator.evaluate(output) print("Variable: "+str(increment)+" Accuracy: "+str(accuracy))

Variable: 200 Accuracy: 0.55 Why do I get accuracy so low ? What does accuracy mean – I have seen that increasing k helps increase the accuracy

From: Jelmer Kuperus notifications@github.com Sent: Sunday, October 25, 2020 6:16 AM To: jelmerk/hnswlib hnswlib@noreply.github.com Cc: anuj-bansal anujb@hotmail.com; Author author@noreply.github.com Subject: Re: [jelmerk/hnswlib] Poor results for similarity match (#23)

Sorry I don't think i receive notifications for closed issues

I quickly looked at your notebook, but if i execute

from pyspark.sql.functions import col, posexplode

brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine',

                   k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')

model = brute_force.fit(shapes_df)

model.transform(sample_data_df).select(col('id'), posexplode('exact')).show(1000, False)

To see the actual distances. I get something like

+-------+---+------------------------------+

|id |pos|col |

+-------+---+------------------------------+

|6000002|0 |[5000250, 0.8234763394288819] |

|6000002|1 |[5000200, 0.8310418384742924] |

|6000002|2 |[5000150, 0.8319127749920988] |

|6000002|3 |[4000000, 0.832754180609055] |

|6000002|4 |[5000100, 0.8374621312616415] |

|6000001|0 |[4000240, 0.19548378187002968]|

|6000001|1 |[4000220, 0.22794030261194886]|

|6000001|2 |[4000180, 0.32675403485435484]|

|6000001|3 |[4000200, 0.35895343269006985]|

|6000001|4 |[4000140, 0.3647760514009075] |

|6000000|0 |[4000000, 0.8439331796919047] |

|6000000|1 |[4000020, 0.8503785879823388] |

|6000000|2 |[4000040, 0.8581594365981059] |

|6000000|3 |[4000200, 0.8613001241683248] |

|6000000|4 |[4000060, 0.8642670991707682] |

|6000003|0 |[5000200, 0.8736204199114561] |

|6000003|1 |[5000150, 0.8778530512884285] |

|6000003|2 |[5000250, 0.8786221170517285] |

|6000003|3 |[5000100, 0.8887594120234005] |

|6000003|4 |[4000100, 0.8897079131598035] |

+-------+---+------------------------------+

As you can see the distances are very large. thus your query set is not very similar to anything. This makes it very hard to get good quality approximative results

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jelmerk/hnswlib/issues/23#issuecomment-716146643, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVDCQUNNSVK7OWKC4RPZ33SMQQKLANCNFSM4SCH6LIQ.

jelmerk commented 3 years ago

What does accuracy mean – I have seen that increasing k helps increase the accuracy

Precision at k

Eg what fraction of the approximate k results are part of the actual top k results