princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.33k stars 505 forks source link

Distance vs. similarity confusion in faiss #198

Closed salmanmashayekh closed 1 year ago

salmanmashayekh commented 1 year ago

search method of a faiss index returns L2 distances, rather than similarities. But the following line in pack_single_result assumes similarities:

results = [(self.index["sentences"][i], s) for i, s in zip(idx, dist) if s >= threshold]

I think it should change to the following:

results = [(self.index["sentences"][i], d) for i, d in zip(idx, dist) if d <= threshold]

Note the <= vs >= in the list comprehension.

Alternatively, we can change the metric_type to use cosine similarity instead of L2 distance:

index.metric_type = faiss.METRIC_INNER_PRODUCT
gaotianyu1350 commented 1 year ago

Hi thanks for reporting this! It only occurs when using fast faiss and now this problem has been fixed!