openfoodfacts / robotoff-ann

GNU Affero General Public License v3.0
0 stars 0 forks source link

We should use cosine similarity and not euclidian one when computing nearest neighbours #148

Open alexgarel opened 2 years ago

alexgarel commented 2 years ago

What

The embeddings we use is generating a metric space based on cosine similarity. While computing nearest neighbours, we use euclidian distance. Change this to use cosine similarity.

alexgarel commented 2 years ago

Note that during data for good, @tcayla has done some test on it and confirmed it's a cosine similarity.

raphael0202 commented 2 years ago

I don't get what you mean by "The embeddings we use is generating a metric space based on cosine similarity." Can you develop?

alexgarel commented 2 years ago

When you generate an embedding, you are generating a space with a learned metric (this is another way to see it). So you should consider the right metric.

It is well known that in word2vec, the metric is cosine similarity and not euclidean distance (and it's quite different, especially if your vectors are not normalized, but even if they are normalized, and the gap widen with the number of dimensions).

I strongly suspected the logo embedding space to also be based on cosine similarity (it seems to me this is the case for most modern embedding algorithm based on training a NN and taking last hidden layer activation). @tcayla did some experiments and told me that it was indeed the case.

Note that cosine similarity has the drawback of not respecting the triangular inequality, and so nearest approx method may fail, but in practice this is not significant (to respect triangular inequality one should normalize vectors and switch to arctan distance, but that's really not worth it)

raphael0202 commented 2 years ago

Do you have any references on this matter? It's true that cosine similarity is the best metric when using word2vec embeddings, but I had never heard of any learned metric associated with an embedding space. Besides the case a model explicitly uses a cosine similarity metric (such as some Siamese networks or the CLIP model for example), I can't really see why cosine similarity would be more appropriate than euclidian distance when using negative log likelihood losses during training. From my experience, cosine similarity and euclidian distance gives really similar results on KNN on embeddings obtained from classic computer vision models (ResNet,...).

alexgarel commented 2 years ago

@raphael0202 I might be wrong, but as I told you it was a supposition and @tcayla seemed to confirm it from some experiments.

From my understanding, metric learning is another word for creating a space of similarities :-) And here embedding is about that (we are matching logo for how they look alike in real life, that is we learn a metric between the real objects represented by those images). Also an internet search on "metrics learning and embedding" might convince you that there is a strong link (but I have no synthetic document on it, and I may be wrong also, this is more my intuition about it).

For cosine similarity also I don't have a synthetic reference, and I may be wrong or inaccurate. Although there seem to be an advocacy for cosine similarity in high dimension (see https://stats.stackexchange.com/questions/29627/euclidean-distance-is-usually-not-good-for-sparse-data-and-more-general-case)