Why are the models fine-tuned with CosineSimilarity between 0 and 1?

EdouardVilain-Git commented 1 year ago

Hi everyone,

This is a small question related to how models are fine-tuned during the first step of training. I see that the default loss function is losses.CosineSimilarityLoss. But when generating sentence pairs here, negative ones are assigned a 0 label. I understand that having scores between 0 and 1 is ideal, because they can be interpreted as probabilities. But cosine similarity ranges from -1 to 1, so shouldn't we expect the full range to be used? The model head can then make predictions on a more isotropic embedding space. Is this related to how Sentence Transformers are pre-trained?

Thanks for your clarifications!

tomaarsen commented 1 year ago

Hello @EdouardVilain-Git,

That is a great question, and an even better observation! Consider me impressed.

Theoretical perspective

I did some digging today, in particular on the SentenceTransformer documentation, in particular around the CosineSimilarityLoss. The documentation seems to indicate that the loss is computed like:

$$||label - cosine\_sim(u, v)||_2$$

In the current SetFit implementation, a negative sample corresponds to a label of 0. If we consider the best case scenario for a negative pair of two output embeddings $u$ and $v$, i.e. they are exact opposites, then $$cosine\_sim(u, v) = -1$$ and thus:

$$||0 - -1||_2 = 1$$

We would have expected a 0 loss for this scenario. If instead we used -1 as our label for a negative pair, then we would have the expected 0 loss. This is theoretical evidence in favor of using -1 for the negative pairs.

The documentation for CosineSimilarityLoss gives a curious example, where two texts "Another pair" and "Unrelated sentence" are given a label of 0.3. This is unexpectedly high if the loss expects values between -1 for unrelated to 1 for identical texts. I think that the label in this test ought to be negative, e.g. -0.3.

In conclusion, I believe that in theory, a label of -1 would be preferable for negative pairs using this loss function.

Practical perspective

Conveniently, SetFit comes equipped with a useful set of scripts for reproducing the paper results (see scripts/setfit). Using it, we can easily compare actual results of 0 versus -1.

Commands to reproduce

``` python .\scripts\setfit\run_fewshot.py --sample_sizes=8 --batch_size 4 --is_test_set=true ``` This script trains SetFit on 6 different datasets, 10 times each, and tracks the performance each time. The `enron_spam` dataset gave an OOM exception on my computer, so I ran the following command separately: ``` python .\scripts\setfit\run_fewshot.py --sample_sizes=8 --lr=0.01 --batch_size 2 --dataset=enron_spam ``` Then, I ran ``` python .\scripts\create_summary_table.py --path .\scripts\setfit\results\paraphrase-mpnet-base-v2-CosineSimilarityLoss-logistic_regression-iterations_20-batch_4\ ``` and ``` python .\scripts\create_summary_table.py --path .\scripts\setfit\results\paraphrase-mpnet-base-v2-CosineSimilarityLoss-logistic_regression-iterations_20-batch_2\ ``` This creates `summary_table.csv` files with averages and standard deviations. Afterwards, I modified the following two lines to be `-1.0` instead of `0.0`: https://github.com/huggingface/setfit/blob/35c0511fa9917e653df50cb95a22105b397e14c0/src/setfit/modeling.py#L565 https://github.com/huggingface/setfit/blob/35c0511fa9917e653df50cb95a22105b397e14c0/src/setfit/modeling.py#L592 And I cut the `scripts/setfit/results` folder away, so that the experiments would run anew. And I repeated all of the commands above. The results were then read from the produced `summary_table.csv` files and placed in the table below.

	`emotion` (acc)	`SentEval-CR` (acc)	`sst5` (acc)	`ag_news` (acc)	`enron_spam` (acc)	`amazon_counterfactual_en` (matthews_correlation)
`SetFit` (negative pair label=0.0)	46.6 (4.4)	88.5 (0.9)	43.4 (3.0)	82.8 (2.7)	88.6 (3.3)	40.3 (13.0)
`SetFit` (negative pair label=-1.0)	48.1 (3.8)	88.1 (1.2)	44.1 (2.3)	81.5 (3.7)	87.8 (5.5)	39.3 (14.4)

Standard deviations in parentheses.
Models trained using 8 samples per class.
Results are over 10 separate executions.
Batch size of 4, except for enron_spam, which has a batch size of 2.
No fixed seed.

I want to point out that in this scenario, with just 8 samples per class, the quality of those samples may be very important. This may explain the inconsistent results between the two tests and the relatively large standard deviations. It would be interesting to see if there are clearer differences between using 0.0 and -1.0 as the negative pair label if the two approaches used the same 10 different seeds. From a practical point of view, it seems like there is nothing conclusive that we can say about whether using 0.0 or -1.0 is preferable.

Note

Note also that it is a bit naive to simply change 0.0 to -1.0 in the following line. https://github.com/huggingface/setfit/blob/35c0511fa9917e653df50cb95a22105b397e14c0/src/setfit/modeling.py#L565 This is because SetFitModel accepts various loss functions which may each require different label ranges, so simply changing 0.0 to -1.0 may cause errors or degraded performance. See the sentence transformer Losses page for some examples.

Again, very well spotted. I'm curious to hear what others think of this finding.

cc: @lewtun

Tom Aarsen

EdouardVilain-Git commented 1 year ago

Hi @tomaarsen,

Thanks for replying, the results are really interesting to look at! Given the comparison, I guess that it's more interesting to keep a negative pair label of -1, though it isn't the most logical theoretically-wise. I'm curious to know if this comes from how SentenceTransformers are pre-trained, but can't find the related documentation.

Please let me know if other members have insights on this!

LuketheDukeBates commented 1 year ago

Great question @EdouardVilain-Git and great answer @tomaarsen! This is something Nils and I pondered over for a bit during the early days of SetFit. We came to the conclusion that it doesn't really matter in the binary case because, from a slightly more intuitive perspective, by using 1 and -1, you're saying that labels belong on opposite sides of the vector space. However, in the multiclass case, it must be 1 and 0 (or 0 and -1) because "opposite" doesn't make sense here, however, labels can be assumed to be independent. I played around with it a bit here if you want to see some lovely TSNE plots.

tomaarsen commented 1 year ago

@LuketheDukeBates It's great to see some additional experiments have been carried out regarding this topic, and I'm glad to see that it wasn't overlooked or done by accident. In preparation for my thesis which involves SetFit, I also further experimented with using variations on the positive and negative pair labels. I've grabbed this pdf as a small part of my larger manuscript that describes my recent experiments surrounding SetFit.

To summarize, using a negative pair label of -1 does notably impact performance, but rarely for the better. Additionally, using e.g. a positive label 0.9 or "scaling" the cosine similarity by 1.1 and then clamping back to [-1, 1] does not notably affect performance. The latter approach seemed promising, as it results in a loss of 0 for a positive pair with > 0.91 cosine similarity and a negative pair with < -0.91 cosine similarity. With other words, it means the model won't unnecessarily push already-very-similar samples even closer together. My intuition was that learning for "perfect similarity" would create more overfitting and reduce generalizability.

Also, I'll always have a love for t-SNE plots. They've come in very handy when investigating overfitting of SetFit models recently.

Tom Aarsen

LuketheDukeBates commented 1 year ago

@tomaarsen I'm delighted to hear you're exploring it in your thesis! I came to a similar conclusion for SetFit's training objective and overfitting regarding the number of training samples here.

Are you suggesting using a continuous label for the ST fine-tuning? I think that makes a lot of sense and would be eager to see it in pratice.

tomaarsen commented 1 year ago

@LuketheDukeBates Thanks for linking your paper! I'm very interested to read it through more thoroughly. It is very interesting to see SetFit its performance decrease so drastically for larger steps on amazon CF. If I have further questions, can I reach out to you on Slack? I think I can find you via the HF Slack.

And yes, I am interested in modifying the contrastive learning phase, in particular surrounding the labels and the pair sampling. I think more robust sampling approaches are worth adding to this repository, although it seems like they do not impact performance notably (e.g. #268). What exactly do you mean with a continuous label? Using non-binary labels based on some computed metric, kind of like what is used in the SetFit distillation learning trainer? https://github.com/huggingface/setfit/blob/944e9e26ac1990db16e68a8a0ff48d8d2de8f9a3/src/setfit/modeling.py#L737-L739

LuketheDukeBates commented 1 year ago

@tomaarsen Sure, please free to contact me any and all ways. :-) Oh, cool! Yes, it would be neat to see different sampling methods for SetFit. For example, I think using a coreset approach to sampling would be an elegant way of exploiting the ST's pretrained knowledge. Although, random sampling is notoriously strong.

Ah, kind of? I think I misunderstood your earlier comment.

tomaarsen commented 1 year ago

I'll be in touch!

And as for your continuous label comment, I'm interested to hear what you had in mind. Perhaps it is still worth a consideration! My scaling strategy is perhaps best described in my pdf above, but may still be somewhat unclear as the document is primarily just for notekeeping for myself. It's very simple: it involves supplying the CosineSimilarityLoss with a cos_score_transformation here: https://github.com/UKPLab/sentence-transformers/blob/3e1929fddef16df94f8bc6e3b10598a98f46e62d/sentence_transformers/losses/CosineSimilarityLoss.py#L40 The transformation is simply:

SCALING_FACTOR = 1.1

def transformation(cos_similarity: t.Tensor) -> t.Tensor:
    return t.clip(cos_similarity * SCALING_FACTOR, -1, 1)

Then, positive pairs with similarity 0.9+ will have 0 loss and won't affect training.

stephantul commented 1 year ago

To chime in here: As mentioned, I think it is important to realize that in the cosine similarity 0 means orthogonal, while -1 means opposite.

In particular, for every normalized vector x, there exists exactly one normalized vector y for which the cosine similarity between x and y is -1, while there exist an infinite number of vectors for which the cosine similarity between x and that vector is 0. It is therefore undesirable to use a loss which only is 0 when the cosine between two vectors is -1, as this means that you will push unrelated negatives towards the same point, which is of course kind of weird, since unrelated negatives might have different class labels.

I'm surprised that the scores don't degrade that much when using -1 as a target. However, I suspect that most vectors will end up having a cosine similarity of around 0 anyway, even when attempting to push them towards -1.

tomaarsen commented 1 year ago

Good analysis, @stephantul. I think you're quite right.

huggingface / setfit