microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2? #1480

Open weiZhenkun opened 6 months ago

weiZhenkun commented 6 months ago

Describe I am using model e5-base-v2, I have seen the doc in the https://huggingface.co/intfloat/e5-base-v2, the doc says the cosine similarity scores distribute around 0.7 to 1.0.

how I use the e5-base-v2 model?

My questions:

@intfloat Can you help me?

intfloat commented 6 months ago
  1. Is it a right way? Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2? Yes, L2 distance between normalized embeddings is mathematically equivalent to cosine similarity. The difference is that smaller L2 distance means better match, for cosine similarity, a higher score means better match.

  2. If we use the cosine similarity, need I normalize the embeddings? No, you do not. The cosine similarity computation already contains a normalization step.

  3. If the threshold of the entire e5-base-v2 is [0.7,1], is there a suitable range for the relatively similar areas? It really depends on your application. It is better to determine the threshold based on a validation dataset.

weiZhenkun commented 6 months ago

@intfloat Thanks for your quick response. 2 more questions:

  1. If it is used to compare text similarity, is it recommended to use L2 or consine to calculate the embeddings generated by e5-base-v2?
  2. Is there any range for L2?
weiZhenkun commented 6 months ago

@intfloat Thanks for your quick response. 2 more questions:

  1. If it is used to compare text similarity, is it recommended to use L2 or consine to calculate the embeddings generated by e5-base-v2?
  2. Is there any range for L2?

@intfloat Can you help me? Thanks.