Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2?

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

https://aka.ms/GeneralAI

MIT License

19.62k stars 2.5k forks source link

Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2? #1480

Open weiZhenkun opened 6 months ago

weiZhenkun commented 6 months ago

Describe I am using model e5-base-v2, I have seen the doc in the https://huggingface.co/intfloat/e5-base-v2, the doc says the cosine similarity scores distribute around 0.7 to 1.0.

how I use the e5-base-v2 model?

1. Get 2 embeddings from e5-base-v2
1. Use torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) to normalize the embeddings
1. Compare 2 embeddings using L2

My questions:

1. Is it a right way? Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2?
1. If we use the cosine similarity, need I normalize the embeddings?
1. If the threshold of the entire e5-base-v2 is [0.7,1], is there a suitable range for the relatively similar areas?

@intfloat Can you help me?

intfloat commented 6 months ago

Is it a right way? Can I use L2 to calc the distance between the 2 embeddings created from e5-base-v2? Yes, L2 distance between normalized embeddings is mathematically equivalent to cosine similarity. The difference is that smaller L2 distance means better match, for cosine similarity, a higher score means better match.
If we use the cosine similarity, need I normalize the embeddings? No, you do not. The cosine similarity computation already contains a normalization step.
If the threshold of the entire e5-base-v2 is [0.7,1], is there a suitable range for the relatively similar areas? It really depends on your application. It is better to determine the threshold based on a validation dataset.

weiZhenkun commented 6 months ago

@intfloat Thanks for your quick response. 2 more questions:

If it is used to compare text similarity, is it recommended to use L2 or consine to calculate the embeddings generated by e5-base-v2?
Is there any range for L2?

weiZhenkun commented 6 months ago

@intfloat Thanks for your quick response. 2 more questions:

If it is used to compare text similarity, is it recommended to use L2 or consine to calculate the embeddings generated by e5-base-v2?

Is there any range for L2?

@intfloat Can you help me? Thanks.