princeton-nlp / SimCSE

[EMNLP 2021] SimCSE: Simple Contrastive Learning of Sentence Embeddings https://arxiv.org/abs/2104.08821
MIT License
3.41k stars 515 forks source link

The alignment computed with function implemented by Wang and Isola differs a lot with the paper #85

Closed xbdxwyh closed 3 years ago

xbdxwyh commented 3 years ago

The alignment computed with the function implemented by Wang and Isola differs link a lot with your paper. I compute the alignment by that function directly, and I get a score of 1.21. But as shown in Fig.3 the score of the paper is less than 0.25. Could you tell me how to compute the alignment in this paper? My code is as follows:

def align_loss(x, y, alpha=2):    
    return (x - y).norm(p=2, dim=1).pow(alpha).mean()

def uniform_loss(x, t=2):
    return torch.pdist(x, p=2).pow(2).mul(-t).exp().mean().log()

def get_pair_emb(model, input_ids, attention_mask,token_type_ids):
    outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    pooler_output = outputs.pooler_output
    pooler_output = pooler_output.view((batch_size, 2, pooler_output.size(-1)))
    z1, z2 = pooler_output[:,0], pooler_output[:,1]
    return z1,z2

def get_align(model, dataloader):
    align_all = []
    unif_all = []
    with torch.no_grad():        
        for data in dataloader:
            input_ids = torch.cat((data['input_ids'][0],data['input_ids'][1])).cuda()
            attention_mask = torch.cat((data['attention_mask'][0],data['attention_mask'][1])).cuda()
            token_type_ids = torch.cat((data['token_type_ids'][0],data['token_type_ids'][1])).cuda()

            z1,z2 = get_pair_emb(model, input_ids, attention_mask, token_type_ids)        
            z1 = F.normalize(z1,p=2,dim=1)
            z2 = F.normalize(z2,p=2,dim=1)

            align_all.append(align_loss(z1, z2, alpha=2))

    return align_all

def get_unif(model, dataloader):
    unif_all = []
    with torch.no_grad():        
        for data in dataloader:
            input_ids = torch.cat((data['input_ids'][0],data['input_ids'][1])).cuda()
            attention_mask = torch.cat((data['attention_mask'][0],data['attention_mask'][1])).cuda()
            token_type_ids = torch.cat((data['token_type_ids'][0],data['token_type_ids'][1])).cuda()

            z1,z2 = get_pair_emb(model, input_ids, attention_mask, token_type_ids)        
            z1 = F.normalize(z1,p=2,dim=1)
            z2 = F.normalize(z2,p=2,dim=1)
            z = torch.cat((z1,z2))
            unif_all.append(uniform_loss(z, t=2))

    return unif_all

model = AutoModel.from_pretrained("princeton-nlp/unsup-simcse-bert-base-uncased")
model = model.cuda()
model_name = "unsup-simcse-bert-base-uncased"

align_all = get_align(model, pos_loader)

align = sum(align_all)/len(align_all)
gaotianyu1350 commented 3 years ago

Hi,

For unsupervised models, you should use the representation before pooling -- thus taking outputs.pooler_output is wrong here. Also notice that the data we take to calculate alignment are sentence pairs that have scores higher than 4 in STS-B.

xbdxwyh commented 3 years ago

Hi, Thank you for your prompt reply!

The score of 1.2 is computed using the representation before pooling ( pooler_output = outputs.last_hidden_state[:,0] ). When the representation after pooling is used, the score is 1.632. And in this process, we've been using the sentence pairs that have scores higher than 4 in STS-B.

Looking forward to your reply!

gaotianyu1350 commented 3 years ago

Interesting... In that case the average cosine similarity between two positive sentences would be ~0.4, which doesn't look right to me. I think for positive pairs, the cosine similarity can be very high (>0.8 in general). Maybe take original bert as a start point for debugging?

xbdxwyh commented 3 years ago

Thanks for your answer! We make a mistake in the reshape step ( in the get_pair_emb function ). After changing the reshape step, we get the same results. Thanks!

xbdxwyh commented 3 years ago

Hi, When we compute the alignment of the model unsup-simcse-bert-base-uncased, we get 0.2155, the same result as in the paper. But when we use the model sup-simcse-bert-base-uncased, we get the alignment of 0.1286. it is still different.

Furthermore, we don't get the same uniformity as in the paper. We get the uniformity of about -2.3116 by using the model unsup-simcse-bert-base-uncased with all sentences from STS-Benchmark. But it is about -2.7 in the paper. The code is as follows:

def get_unif(model, dataloader):
    unif_all = []
    with torch.no_grad():        
        for data in dataloader:
            input_ids = torch.cat((data['input_ids'][0],data['input_ids'][1])).cuda()
            attention_mask = torch.cat((data['attention_mask'][0],data['attention_mask'][1])).cuda()
            token_type_ids = torch.cat((data['token_type_ids'][0],data['token_type_ids'][1])).cuda()

            z1,z2 = get_pair_emb(model, input_ids, attention_mask, token_type_ids)        
            z1 = F.normalize(z1,p=2,dim=1)
            z2 = F.normalize(z2,p=2,dim=1)
            z = torch.cat((z1,z2))
            unif_all.append(uniform_loss(z, t=2))

    return unif_all

Many Thanks!

gaotianyu1350 commented 3 years ago

Hi,

You should use MLP when you use the supervised model. Regarding uniformity, I'm not really sure where the difference comes from. My guess is that we use a different proportion of data somehow. There is a chance that I didn't concatenate sentence A and sentence B together (i.e., only calculated the alignment between sentA and sentB) but I'm not sure since I cleaned up the code. But this shouldn't affect the analysis anyway as long as the calculation is consistent.

xbdxwyh commented 3 years ago

Thanks for your patient answer!