princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.36k stars 507 forks source link

How to interpret the logits output in SequenceClassifierOutput? #170

Closed VictorMatt01 closed 2 years ago

VictorMatt01 commented 2 years ago

So I have trained a model using the code from simcse (also using the same functions to calculate similarity) but I just can't find a way to understand the outputs of logits in the SequenceClassifierOutput. When i pass the following sentences as input to the trained model: sentences = [ "Soccer is a sport.", "Soccer is a sport.", "Piano is an instrument.", "Piano is an instrument.", "This is just a normal sentence", "This is just a normal sentence" ] The model will output as logits value the cls.sim function. (https://github.com/princeton-nlp/SimCSE/blob/5005c3daab99cb9f6ff92c526ab751079a169826/simcse/models.py#L226) It looks like a 3x3 matrix of cosine similarity scores but how do you interpret these scores?

gaotianyu1350 commented 2 years ago

The input should be a list of sentence pairs. Based on your description, the 3x3 matrix should be the cosine similarity between the first 3 sentences and the second 3 sentences.

VictorMatt01 commented 2 years ago

so the model takes sentences 0-2 as the x_i sentence and sentences 3-5 as the x_i^+, if we compare it with loss function from the paper? Cause I thought that sentence 0 and 1 were pair, 2-3 a pair of sentences and 4-5 a pair of sentences.

gaotianyu1350 commented 2 years ago

Sorry I was confused by what you were referring to. The code you pointed at takes input as (batch size, 2, sent length). So when flatten out, it should be the case that you described.

VictorMatt01 commented 2 years ago

Sorry for the incomplete question, that's totally my fault. But already a big thanks for the help. So below are the sim score that logits returns, so i assume that it is as follows?

Screenshot 2022-05-15 at 09 25 13

But now one last question, what if I train the model supervised with 3 sentences as input, is it the same format: (sen0,sen1,sen2), (sen3,sen4,sen5), ...

gaotianyu1350 commented 2 years ago

Hi,

Yes your understanding is correct. And if you want to train with 3 sent, the format you mentioned is also correct.