huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.01k stars 26.54k forks source link

Discrepancy in results ( BertModel) between pytorch_pretrained_bert and transformers #2506

Closed chikubee closed 4 years ago

chikubee commented 4 years ago

🐛 Bug

Model I am using: BERT

Language I am using the model on (English, Chinese....): English

from transformers import BertTokenizer, BertModel
tokenizer2 = BertTokenizer.from_pretrained('bert-base-uncased')
model2 = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True, output_attentions = True)
model2.eval()

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

def get_tokenized_text(text):
    marked_text = "[CLS] " + text + " [SEP]"
    tokenized_text = tokenizer.tokenize(marked_text)
    return tokenized_text

def get_embeddings_concat_last_4(doc):
    indexed_tokens = tokenizer.convert_tokens_to_ids(doc)
    segments_ids = [1] * len(doc)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    with torch.no_grad():
        encoded_layers, _ = model(tokens_tensor, segments_tensors)  

        token_embeddings = torch.stack(encoded_layers, dim=0)
        token_embeddings = torch.squeeze(token_embeddings, dim=1)
        token_embeddings = token_embeddings.permute(1,0,2)

    token_vecs_cat = []
    for token in token_embeddings:
        cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)
        token_vecs_cat.append(cat_vec)
    return token_vecs_cat

def get_embeddings_transformers(text, tokenizer2, model2):
    input_ids = torch.tensor([tokenizer2.encode(text, add_special_tokens=True)])  # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
    with torch.no_grad():
        all_hidden_states, all_attentions = model2(input_ids)[-2:]
    pooled_output = torch.cat(tuple([all_hidden_states[i] for i in [-4, -3, -2, -1]]), dim=-1)
    return pooled_output

At the sentence level Transformers vs Pytorch pretrained bert

out1 = get_embeddings_transformers("programming in C covers coding as well as concepts", tokenizer2, model2)
out2 = get_embeddings_transformers("i want to learn coding", tokenizer2, model2)
get_cosine(out1[0][1], out2[0][5]), get_cosine(out1[0][5], out2[0][5])
out1 = get_embeddings_concat_last_4(get_tokenized_text("programming in C covers coding as well as concepts"))
out2 = get_embeddings_concat_last_4(get_tokenized_text("i want to learn coding"))
get_cosine(out1[1], out2[5]), get_cosine(out1[5], out2[5])
Screenshot 2020-01-13 at 12 21 33 PM

Please find attached the code snippets. model: bert_base_uncased I am trying to find similarity between "coding" and "kills" Sentence1: coding Sentence2: Smoking kills Similarity when i load the bert_model with pytorch_pretrained_bert is 0.58 Similarity when i load the bert_model with transformers is 0.68

The difference is huge. Can one tell me why is this happening??????? @thomwolf

BramVanroy commented 4 years ago

Please edit your post and remove the images. Instead, post the code inside Python code tags. It is hard to read your post like this and impossible to copy-paste and try it out ourselves.

chikubee commented 4 years ago

@BramVanroy Thanks for your response. I have edited the post. Please reproduce the results yourself. I'll be thankful if I can find out why that's happening.

chikubee commented 4 years ago

@BramVanroy I even printed the embedding layers of the pretrained model after loading it via transformers and pytorch pretrained bert, they were fairly different.

BramVanroy commented 4 years ago

It seems that part of your code is missing or incorrect. You don't seem to initialize the pytorch_pretrained_bert model anywhere. This needs to be fixed, of course.

It also seems that you never called model.eval().

The order of concatenation is different in both cases. (One ascending, other descending.) I'm not sure how important this is in the end. In this case, I don't think it should matter but it's worth checking.

If you can post the real, reproducible and correct code that I just need to copy-paste I can have a better look.

chikubee commented 4 years ago

It seems that part of your code is missing or incorrect. You don't seem to initialize the pytorch_pretrained_bert model anywhere. This needs to be fixed, of course.

It also seems that you never called model.eval().

The order of concatenation is different in both cases. (One ascending, other descending.) I'm not sure how important this is in the end. In this case, I don't think it should matter but it's worth checking.

If you can post the real, reproducible and correct code that I just need to copy-paste I can have a better look. @BramVanroy Thanks for your quick reply. I had done model.eval() but had not added it here, Sorry for the inconvenience, I have updated the snippet. Good point, I'll check again after changing the order of concatenation. But the results were different for sum as well. And is obvious, for the raw embeddings only obtained by encodedlayers, = model(tokens_tensor, segments_tensors) are different in the two cases.

chikubee commented 4 years ago

@BramVanroy I tried after changing the order of concatenation as well, results remain unchanged as you suggested.

BramVanroy commented 4 years ago

You can also check the tokenizers: verify that the tokenisation is identical.

If you can provide a full test suite I can test it.

chikubee commented 4 years ago

@BramVanroy Tokenization is same. I got my mistake. It was because i was passing segment ids with the pytorch-pretrained-bert loaded model, while i just passing the tokenized ids to transformers loaded model. Thanks for helping me figure this out. As the input was different, encoded layers would be different.

One place where i am still stuck is that, when i don't add segment ids to the input the results are much worse. In the documentation of transformers we just pass token ids. Why is that, what is its implications/ I have added the test cases here. https://github.com/chikubee/Test-Suite-BERT/blob/master/test-suite.ipynb I fail to understand why that's happening. Thanks in advance.

BramVanroy commented 4 years ago

Always go back to the source code. The order of the arguments was swapped. I had actually never noticed this before, but I think it's good practice to always provide the parameter name for optional arguments instead of treating them as positional.

As you can see, in the current implementation the second argument is actually attention_mask:

https://github.com/huggingface/transformers/blob/b8f43cb273a7db25b285d78bf937590dc2ce11fc/src/transformers/modeling_bert.py#L683-L693

In pytorch_pretrained_bert, the second argument is token_type_ids.

https://github.com/huggingface/transformers/blob/b832d5bb8a6dfc5965015b828e577677eace601e/pytorch_pretrained_bert/modeling.py#L709

You can try it again, and explicitly set the kwargs:

model(tokens_tensor, token_type_ids=segments_tensors)
chikubee commented 4 years ago

@BramVanroy yeah I saw that, works just fine, will be closing this issue. Thanks for your quick respsone. Since the use of segment tensor is just to indicate portions of the input, I wonder how its absence is affecting the results of similarity that much.

BramVanroy commented 4 years ago

It's because the token_type_ids are expected to be zero for the first segment and ones for the second, and masks are expected to be ones for unmasked tokens and zeros for masked tokens.

In your case it's not so much the absence of token_type_ids (because they are not absent; they get a default value) but they have the opposite value in the two cases. So in one case you're saying that the segment you are passing is the first one, and in the second case that you're passing in the second segment.

chikubee commented 4 years ago

@BramVanroy Got it, Thanks Bram. Much appreciated. But when i don't add anything explicitly (which means default 0 for first segment), the results of similarity are very bad as documented here https://github.com/chikubee/Test-Suite-BERT/blob/master/test-suite.ipynb