Closed XuhuiZhou closed 5 years ago
Hi!
If you saved the model BertForMultipleChoice
to a directory, you can then load the weights for the BertForMaskedLM
by simply using the from_pretrained(dir_name)
method. The transformer weights will be re-used by the BertForMaskedLM
and the weights corresponding to the multiple-choice classifier will be ignored.
Hi! Thanks for answering me. And this is what I have done at first, which resulted in the following: As you can see, the output tensors are all zeros, which seems to be really weird!
Although this might happen, I still want to confirm that I am doing the right thing, I basically calculating each masked word's probability. And some of them are zero which results in the final sentence zero probs.
Could you share a code snippet that reproduces what you're trying to do so that I can try and see on my side?
For sure!
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import numpy as np
import math
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)
def predict(text, bert_model, bert_tokenizer):
# Tokenized input
# text = "[CLS] I got restricted because Tom reported my reply [SEP]"
text = "[CLS] " + text + " [SEP]"
tokenized_text = bert_tokenizer.tokenize(text)
# text = "[CLS] Stir the mixture until it is done [SEP]"
#masked_index = 4
sentence_prob = 1
for masked_index in range(1,len(tokenized_text)-1):
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_word = tokenized_text[masked_index]
#tokenized_text[masked_index] = '[MASK]'
# assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
# print (tokenized_text)
# Convert token to vocabulary indices
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
# segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
length = len(tokenized_text)
segments_ids = [0 for _ in range(length)]
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
# Load pre-trained model (weights)
# bert_model = BertForMaskedLM.from_pretrained('bert-large-uncased')
# bert_model.eval()
# If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
bert_model.to('cuda')
# Predict all tokens
with torch.no_grad():
predictions = bert_model(tokens_tensor, segments_tensors)
predictions = torch.nn.functional.softmax(predictions, -1)
index = bert_tokenizer.convert_tokens_to_ids([masked_word])[0]
curr_prob = predictions[0, masked_index][index]
if curr_prob.item()!=0:
#print(curr_prob.item())
sentence_prob *= curr_prob.item()
# predict_list = predictions[0, masked_index]
#tokenized_text[masked_index] = masked_word
#return math.pow(sentence_prob, 1/(len(tokenized_text)-3))
return sentence_prob
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('./tmp/swag_output')
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('./tmp/swag_output')
model.eval()
# prob = predict(sentence_1, bert_model=model, bert_tokenizer=tokenizer)
with open("Sentence4leyang.txt", "r") as f:
file = f.readlines()
num = len(file)
count = 0
curr = 0
for i in file:
label, sentence_1, sentence_2, sentence_3 = i.split("\001")
print (label[0])
prob_1 = predict(sentence_1, bert_model=model, bert_tokenizer=tokenizer)
prob_2 = predict(sentence_2, bert_model=model, bert_tokenizer=tokenizer)
prob_3 = predict(sentence_3, bert_model=model, bert_tokenizer=tokenizer)
answer = max(prob_1, prob_2, prob_3)
print(prob_1, prob_2, prob_3)
For the txt file, you could just create some sentences to replace it. We used the weight after fine-tuning the Bert with official run_swag.py example.
If you finetuned a BertForMultipleChoice
and load it in BertForMaskedLM
some weights will be initialized randomly and not trained.
This is indicated in this part of your output:
If you use this model with un-trained weights you will have random output. You need to train these weights on a down-stream task.
Hi, Thanks for the response. @thomwolf
However, from my perspective, even if you use the vanilla Bert-base-uncased
model, the BertForMaskedLM
still runs perfectly without any random initialization. And I assume BertForMultipleChoice
is simply the original Bert-base-uncased
model with an additional linear classifier layer.
Therefore, I think there should be a way to only keep the 'Bert model' but without the linear layer after fine-tuning. I think this feature could be really helpful for researchers to investigate the transferability of the models.
No unfortunately.
So the model used for pretraining bert and the one we provide on our AWS S3 bucket is BertForPretraining
which has 2 heads: (i) the masked lm head and (ii) the next sentence prediction head.
BertForMaskedLM
is a sub-set of BertForPretraining
which keeps only the first head => all the weights are initialized with pretrained weights if you initialize it from the provided weights, you can use it out-of-the-box.
BertForMultipleChoice
does NOT have a masked lm head and has instead a multiple-choice head => if you train this model and use it to initialize a BertForMaskedLM
you won't initialize the language model head.
If you don't remember: just look at the log during model initialization. If it's written Weights from XXX not initialized from pretrained model
it means you have to train the model before using it.
We will make the documentation more clear on that.
For your specific use-case, a solution could be to make a model your-self similarly to the way they are made in the library and keep the language modeling head as well as the other heads you want. And then fine-tune the newly added head on your dataset.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, I am currently using this code to research the transferability of those pre-trained models and I wonder how could I apply the fine-tuned parameter of a model to another model. For example, I fine-tuned the BertForMultipleChoice and got the pytorch_model.bin, and what if I want to use the parameters weight above in the BertForMaskedLM.
I believed there should exist a way to do that since they just differ in the linear layer. However, simply use the BertForMaskedLM.from_pretrained method is problematic.