huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.29k stars 26.35k forks source link

Different usage between BertModel and AlbertModel #2386

Closed renjunxiang closed 4 years ago

renjunxiang commented 4 years ago

❓ Questions & Help

Hi~

bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0] 

I found last_hidden_states was not fixed when I reload BertModel.from_pretrained(bert_path).

bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = AlbertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0] 

I found last_hidden_states was fixed. But When I tried

bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = RobertaModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0] 

I found last_hidden_states was still not fixed.

bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path)
...
with torch.no_grad():
    last_hidden_states = BERT(input_ids)[0] 

I found last_hidden_states was fixed.

Is there any difference in their usage between BertModel, AlbertModel and RobertaModel?

In my past projects, I used BERT(freeze)+LSTM. This is the first time to use ALBERT.

Thanks~

BramVanroy commented 4 years ago

Did you do model.eval() to disable dropout and norm before torch.no_grad()?

renjunxiang commented 4 years ago

Yes. Because they didn‘t’ throw any exception, I'm a little confused about their usage.

import torch
from transformers import BertTokenizer, BertModel
from transformers import AlbertTokenizer, AlbertModel
from transformers import RobertaTokenizer, RobertaModel

device = 'cuda:0'

# https://storage.googleapis.com/albert_models/albert_base_zh.tar.gz
bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = AlbertModel.from_pretrained(bert_path) # fixed

'''
bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path) # random output
'''

'''
# https://drive.google.com/open?id=1eHM3l4fMo6DsQYGmey7UZGiTmQquHw25
bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path) # fixed
'''

'''
bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = RobertaModel.from_pretrained(bert_path) # random output
'''

BERT.eval()
BERT = BERT.to(device)

text_seqs = []
segments_ids = []
text_seq = tokenizer.convert_tokens_to_ids(['[CLS]', '我', '爱', '北', '京', '[SEP]', '[PAD]'])
text_seqs.append(text_seq)
segments_ids.append([0] * 7)
text_seqs = torch.LongTensor(text_seqs).to(device)
segments_ids = torch.LongTensor(segments_ids).to(device)

mask_bert = torch.where(text_seqs == 0,
                        torch.zeros_like(text_seqs),
                        torch.ones_like(text_seqs))
with torch.no_grad():
    sentence_features, _ = BERT(text_seqs, token_type_ids=segments_ids, attention_mask=mask_bert)
sentence_features = sentence_features[-1]

for i in sentence_features:
    print(i[:4])
LysandreJik commented 4 years ago

@renjunxiang, you seem to be using the same pretrained checkpoint for both BERT and ALBERT. This should crash as these models are not the same.

Do you face the same issue when loading from pretrained checkpoints hosted on our S3 (bert-base-cased and albert-base-v2 for example) ?

renjunxiang commented 4 years ago

@LysandreJik Yes, I used same pretrained Chinese albert model provided by Google(albert_base_zh.tar) and I used convert_albert_original_tf_checkpoint_to_pytorch.py to transform the model.

Because BertModel and AlbertModel didn‘t’ throw any exception, I thought they are interchangeable. Maybe the reason of random output is the missing key between BertModel and AlbertModel? like https://github.com/huggingface/transformers/issues/2387#issuecomment-571586232

bert-base-cased and albert-base-v2 are constrained to the function(BertModel and AlbertModel), so they are not interchangeable.

In my past projects, I used BertModel.from_pretrained to load pretrained model such as bert-base-chinese and chinese_roberta_wwm_ext.

I found RobertaModel could load chinese_roberta_wwm_ext and didn‘t’ throw any exception, but the output was random.

So is there some different usage between RobertaModel and BertModel if I want to get the last_hidden_states? In my mind Roberta is one of BERT.

thanks~

BramVanroy commented 4 years ago

It's not really clear what you are trying to say. The models are obviously different, so use the appropriate init for the appropriate model (BERT for BERT weights, RoBERTa for RoBERTa weights). That being said, retrieving the last hidden states should be similar. You can compare the docs:

renjunxiang commented 4 years ago

Thanks! I'll check it out.