High (in terms of absolute cosine similarity) similarity scores between two completely irrelevant / random pairs.

changun commented 1 year ago

Describe the bug E5-small/base/large v1/v2

Hi unilm team,

Thank you so much for the great project! We try to replace sentence-transformer with E5. When using the official example scripts with E5-small/base/large v1/v2 models, the cosine/inner product similarity between two completely irrelevant sentences is still over 0.7. This occurs even when using query/paragraph prefixes. As a result, there are no pairs of sentences with scores significantly lower than 0.7.

While the relevant pairs' similarity scores' are indeed relatively higher than the irrelevant pairs' score (e.g. around 0.8), but the issue is the negative pairs' score in absolute value are still pretty high. This high similarity between irrelevant pairs makes Approximate KNN retrieval ineffective and difficult to use in search retrieval applications.

As a result, the performance of the E5 model is worse for our application compared to other models with worse benchmark results, such as all-MiniLM-L6-v2.

We are wondering if you have any advice or suggestions to work around this high baseline similarity score issue. Thank you so much!

Expected Behavior: The similarity between irrelevant pairs should be significantly lower, ideally below 0.5 or even lower. Alternatively, it may be necessary to use a different similarity measurement other than inner product or cosine similarity.

intfloat commented 1 year ago

This is a known and expected behavior. For tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this is generally not an issue.

Can you provide more information about your application? For example, what type of tasks are you dealing with? And do E5 models perform worse than all-MiniLM-L6-v2 even with exhaustive vector search?

If it is indeed an issue, you can use whitening operation to make the embeddings distribute more uniformly as described in the paper https://arxiv.org/abs/2103.15316, but this would add additional complexity.

mgoldenbe commented 1 year ago

@intfloat I haven't looked up what whitening means, but would simply rescaling the score not be adequate? That is, given a score X in [0.7, 1.0], take as score 1.0 - (1 - X)/0.7.

intfloat commented 1 year ago

@intfloat I haven't looked up what whitening means, but would simply rescaling the score not be adequate? That is, given a score X in [0.7, 1.0], take as score 1.0 - (1 - X)/0.7.

Sure, you can do something like that. But this does not really affect the rankings...

mgoldenbe commented 1 year ago

@intfloat I haven't looked up what whitening means, but would simply rescaling the score not be adequate? That is, given a score X in [0.7, 1.0], take as score 1.0 - (1 - X)/0.7.

Sure, you can do something like that. But this does not really affect the rankings...

Now I am confused. I thought the relative rankings were fine, so why would one want to affect them?

intfloat commented 1 year ago

@intfloat I haven't looked up what whitening means, but would simply rescaling the score not be adequate? That is, given a score X in [0.7, 1.0], take as score 1.0 - (1 - X)/0.7.

Sure, you can do something like that. But this does not really affect the rankings...

Now I am confused. I thought the relative rankings were fine, so why would one want to affect them?

Sorry for the possible confusion.

I mean that there is no need to do such rescaling, since the relative rankings are not affected. The high similarity score is not an issue anyway.

freckletonj commented 1 year ago

I ran a small experiment with whitening vs not whitening, and whitening was rather disappointing. My dataset was tiny, n=100, so maybe that was the reason?

The experiment was simple:

labelled data
hold out some
train the whitening bias and kernel on the training data
transform the test data
look up the nearest embedding in the training set, and output its class
check for accuracy of guessed label vs known label

Running whitened vs unwhitened embeddings through that, unwhitened performed generally better, given many different hyper params.

One thing that worked amazing for classification was to create a new embedding for each class containing the average embedding of that class. This really boosted accuracy.

Another accuracy booster has been adding a simple FFNN with Tanh outputs onto the stack, doing supervised learning, and transforming the embeddings that way. Again, this is for classification. Training a NN on a small dataset is essentially going to rebias the original sentence transformer away from generality, but, for a known number of labels, that's ok.

Please others, report what works/doesn't work well for you to squeeze more out of text embeddings!

aalinazar commented 1 year ago

I still don't understand why the relative rankings are fine, From my understanding, relative rankings will only tell what is more similar and what is less similar. But how to filter out the irrelevant ones?

For example: There are 5 passages. Only 1 of them are relevant. but other 4 are not relevant. I want to only have 1 result and filter out the rest (which are not relevant).

Or even worse case scenario. All of them are not relevant. There should be nothing returned as relevant.

How to do that?

freckletonj commented 1 year ago

@aalinazar to put a finer point on it, if your document pool has 3 matches, search should surface 3 matches. If it has 100 matches, then 100. 0 then 0. Otherwise we place an arbitrary threshold on the search.

These models can't handle that, but I recall reading somewhere about " flows" you can use during training to encourage cos_sim to go to 0 for dissimilar items. Please let me know if you end up solving this!

YanDavKMS commented 9 months ago

@aalinazar I totally agree, relative scoring is still a problem if you need a threshold, otherwise you can get irrelevant results. Right now I'm using models from SBert but they have small context size and wanted a larger context size.

BTW OpenAI embedding also has the same issue.

Were you able to solve this?

rishabh16196 commented 6 months ago

does anybody has any ideas to try out for this issue? I am also facing the similar issue where I dont want irrelevant results in my output.

microsoft / unilm

High (in terms of absolute cosine similarity) similarity scores between two completely irrelevant / random pairs. #1216