Closed anuj-bansal closed 4 years ago
It's hard to say without seeing the data, but a higher ef should never give lower accuracy
please see the notebook @ https://1drv.ms/u/s!AomIzr4F_Lm7o6F9XM4enR3XIpgr0w - it is generating a data set of solid colored blocks and gradient blocks then comparing those with 3 images
ef and efconstuction is set as variable
Variable: 100 Accuracy: 0.2
Variable: 200 Accuracy: 0.4
Variable: 300 Accuracy: 0.5
Variable: 400 Accuracy: 0.45
Variable: 500 Accuracy: 0.45
Variable: 600 Accuracy: 0.35
Variable: 700 Accuracy: 0.45
Variable: 800 Accuracy: 0.5
Variable: 900 Accuracy: 0.35
Variable: 1000 Accuracy: 0.4
Can you please look at the code I just attached – not sure if you will look into a closed item – thus sending email as well
From: Jelmer Kuperus notifications@github.com Sent: Saturday, October 10, 2020 1:48 AM To: jelmerk/hnswlib hnswlib@noreply.github.com Cc: anuj-bansal anujb@hotmail.com; Author author@noreply.github.com Subject: Re: [jelmerk/hnswlib] Poor results for similarity match (#23)
Closed #23https://github.com/jelmerk/hnswlib/issues/23.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jelmerk/hnswlib/issues/23#event-3862539961, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVDCQWVJGFL5ZXMY4QFBZLSKANTBANCNFSM4SCH6LIQ.
Sorry I don't think i receive notifications for closed issues
I quickly looked at your notebook, but if i execute
from pyspark.sql.functions import col, posexplode
brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine',
k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')
model = brute_force.fit(shapes_df)
model.transform(sample_data_df).select(col('id'), posexplode('exact')).show(1000, False)
To see the actual distances. I get something like
+-------+---+------------------------------+
|id |pos|col |
+-------+---+------------------------------+
|6000002|0 |[5000250, 0.8234763394288819] |
|6000002|1 |[5000200, 0.8310418384742924] |
|6000002|2 |[5000150, 0.8319127749920988] |
|6000002|3 |[4000000, 0.832754180609055] |
|6000002|4 |[5000100, 0.8374621312616415] |
|6000001|0 |[4000240, 0.19548378187002968]|
|6000001|1 |[4000220, 0.22794030261194886]|
|6000001|2 |[4000180, 0.32675403485435484]|
|6000001|3 |[4000200, 0.35895343269006985]|
|6000001|4 |[4000140, 0.3647760514009075] |
|6000000|0 |[4000000, 0.8439331796919047] |
|6000000|1 |[4000020, 0.8503785879823388] |
|6000000|2 |[4000040, 0.8581594365981059] |
|6000000|3 |[4000200, 0.8613001241683248] |
|6000000|4 |[4000060, 0.8642670991707682] |
|6000003|0 |[5000200, 0.8736204199114561] |
|6000003|1 |[5000150, 0.8778530512884285] |
|6000003|2 |[5000250, 0.8786221170517285] |
|6000003|3 |[5000100, 0.8887594120234005] |
|6000003|4 |[4000100, 0.8897079131598035] |
+-------+---+------------------------------+
As you can see the distances are very large. thus your query set is not very similar to anything. This makes it very hard to get good quality approximative results
Just to clarify, because this tends to confuse datascientists, the values are cosine distances, not similarities. The similarity is obtained with 1 - cosine distance
increment = 100*2 brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')
hnsw = HnswSimilarity(identifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', m=90, ef=increment, k=5, efConstruction=increment, numPartitions=50, similarityThreshold=-1.0, excludeSelf=True, predictionCol='approximate') pipeline = Pipeline(stages=[hnsw, brute_force]) model = pipeline.fit(shapes_df) output = model.transform(sample_data_df) evaluator = KnnSimilarityEvaluator(approximateNeighborsCol='approximate', exactNeighborsCol='exact') accuracy = evaluator.evaluate(output) print("Variable: "+str(increment)+" Accuracy: "+str(accuracy))
Variable: 200 Accuracy: 0.55 Why do I get accuracy so low ? What does accuracy mean – I have seen that increasing k helps increase the accuracy
From: Jelmer Kuperus notifications@github.com Sent: Sunday, October 25, 2020 6:16 AM To: jelmerk/hnswlib hnswlib@noreply.github.com Cc: anuj-bansal anujb@hotmail.com; Author author@noreply.github.com Subject: Re: [jelmerk/hnswlib] Poor results for similarity match (#23)
Sorry I don't think i receive notifications for closed issues
I quickly looked at your notebook, but if i execute
from pyspark.sql.functions import col, posexplode
brute_force = BruteForceSimilarity(identifierCol='id', queryIdentifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine',
k=5, numPartitions=10, excludeSelf=True, similarityThreshold=-1.0, predictionCol='exact')
model = brute_force.fit(shapes_df)
model.transform(sample_data_df).select(col('id'), posexplode('exact')).show(1000, False)
To see the actual distances. I get something like
+-------+---+------------------------------+
|id |pos|col |
+-------+---+------------------------------+
|6000002|0 |[5000250, 0.8234763394288819] |
|6000002|1 |[5000200, 0.8310418384742924] |
|6000002|2 |[5000150, 0.8319127749920988] |
|6000002|3 |[4000000, 0.832754180609055] |
|6000002|4 |[5000100, 0.8374621312616415] |
|6000001|0 |[4000240, 0.19548378187002968]|
|6000001|1 |[4000220, 0.22794030261194886]|
|6000001|2 |[4000180, 0.32675403485435484]|
|6000001|3 |[4000200, 0.35895343269006985]|
|6000001|4 |[4000140, 0.3647760514009075] |
|6000000|0 |[4000000, 0.8439331796919047] |
|6000000|1 |[4000020, 0.8503785879823388] |
|6000000|2 |[4000040, 0.8581594365981059] |
|6000000|3 |[4000200, 0.8613001241683248] |
|6000000|4 |[4000060, 0.8642670991707682] |
|6000003|0 |[5000200, 0.8736204199114561] |
|6000003|1 |[5000150, 0.8778530512884285] |
|6000003|2 |[5000250, 0.8786221170517285] |
|6000003|3 |[5000100, 0.8887594120234005] |
|6000003|4 |[4000100, 0.8897079131598035] |
+-------+---+------------------------------+
As you can see the distances are very large. thus your query set is not very similar to anything. This makes it very hard to get good quality approximative results
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/jelmerk/hnswlib/issues/23#issuecomment-716146643, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVDCQUNNSVK7OWKC4RPZ33SMQQKLANCNFSM4SCH6LIQ.
What does accuracy mean – I have seen that increasing k helps increase the accuracy
Precision at k
Eg what fraction of the approximate k results are part of the actual top k results
seeing very poor results with the algorithm and not sure what i am doing wrong. I have generated a test data set with solid color rectangles and then comparing it with a dataset of mixed images and solid color rectangles. my dimensions are 2048 hnsw = HnswSimilarity(identifierCol='id', featuresCol='decodedFeature', distanceFunction='cosine', m=90, ef=5, k=5, efConstruction=200, numPartitions=10, similarityThreshold=0.4, excludeSelf=True, predictionCol='approximate') i have tried it multiple times with different paramters however i am not getting good results. i have tried ef as 200 as well however increasing the ef gives lower accuracy rate than ef closer to k. will appreciate some guidance