Closed renjunxiang closed 4 years ago
Did you do model.eval() to disable dropout and norm before torch.no_grad()?
Yes. Because they didn‘t’ throw any exception, I'm a little confused about their usage.
import torch
from transformers import BertTokenizer, BertModel
from transformers import AlbertTokenizer, AlbertModel
from transformers import RobertaTokenizer, RobertaModel
device = 'cuda:0'
# https://storage.googleapis.com/albert_models/albert_base_zh.tar.gz
bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = AlbertModel.from_pretrained(bert_path) # fixed
'''
bert_path = 'D:/pretrain/pytorch/albert_base/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path) # random output
'''
'''
# https://drive.google.com/open?id=1eHM3l4fMo6DsQYGmey7UZGiTmQquHw25
bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = BertModel.from_pretrained(bert_path) # fixed
'''
'''
bert_path = 'D:/pretrain/pytorch/chinese_roberta_wwm_ext/'
tokenizer = BertTokenizer.from_pretrained(bert_path)
BERT = RobertaModel.from_pretrained(bert_path) # random output
'''
BERT.eval()
BERT = BERT.to(device)
text_seqs = []
segments_ids = []
text_seq = tokenizer.convert_tokens_to_ids(['[CLS]', '我', '爱', '北', '京', '[SEP]', '[PAD]'])
text_seqs.append(text_seq)
segments_ids.append([0] * 7)
text_seqs = torch.LongTensor(text_seqs).to(device)
segments_ids = torch.LongTensor(segments_ids).to(device)
mask_bert = torch.where(text_seqs == 0,
torch.zeros_like(text_seqs),
torch.ones_like(text_seqs))
with torch.no_grad():
sentence_features, _ = BERT(text_seqs, token_type_ids=segments_ids, attention_mask=mask_bert)
sentence_features = sentence_features[-1]
for i in sentence_features:
print(i[:4])
@renjunxiang, you seem to be using the same pretrained checkpoint for both BERT and ALBERT. This should crash as these models are not the same.
Do you face the same issue when loading from pretrained checkpoints hosted on our S3 (bert-base-cased
and albert-base-v2
for example) ?
@LysandreJik Yes, I used same pretrained Chinese albert model provided by Google(albert_base_zh.tar
) and I used convert_albert_original_tf_checkpoint_to_pytorch.py
to transform the model.
Because BertModel
and AlbertModel
didn‘t’ throw any exception, I thought they are interchangeable. Maybe the reason of random output is the missing key between BertModel
and AlbertModel
? like https://github.com/huggingface/transformers/issues/2387#issuecomment-571586232
bert-base-cased
and albert-base-v2
are constrained to the function(BertModel
and AlbertModel
), so they are not interchangeable.
In my past projects, I used BertModel.from_pretrained
to load pretrained model such as bert-base-chinese
and chinese_roberta_wwm_ext
.
I found RobertaModel
could load chinese_roberta_wwm_ext
and didn‘t’ throw any exception, but the output was random.
So is there some different usage between RobertaModel
and BertModel
if I want to get the last_hidden_states
? In my mind Roberta is one of BERT.
thanks~
It's not really clear what you are trying to say. The models are obviously different, so use the appropriate init for the appropriate model (BERT for BERT weights, RoBERTa for RoBERTa weights). That being said, retrieving the last hidden states should be similar. You can compare the docs:
Thanks! I'll check it out.
❓ Questions & Help
Hi~
I found
last_hidden_states
was not fixed when I reloadBertModel.from_pretrained(bert_path)
.I found
last_hidden_states
was fixed. But When I triedI found
last_hidden_states
was still not fixed.I found
last_hidden_states
was fixed.Is there any difference in their usage between BertModel, AlbertModel and RobertaModel?
In my past projects, I used BERT(freeze)+LSTM. This is the first time to use ALBERT.
Thanks~