huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.4k stars 27.09k forks source link

Use the fine-tuned model for another task #959

Closed XuhuiZhou closed 5 years ago

XuhuiZhou commented 5 years ago

Hi, I am currently using this code to research the transferability of those pre-trained models and I wonder how could I apply the fine-tuned parameter of a model to another model. For example, I fine-tuned the BertForMultipleChoice and got the pytorch_model.bin, and what if I want to use the parameters weight above in the BertForMaskedLM.

I believed there should exist a way to do that since they just differ in the linear layer. However, simply use the BertForMaskedLM.from_pretrained method is problematic.

LysandreJik commented 5 years ago

Hi!

If you saved the model BertForMultipleChoice to a directory, you can then load the weights for the BertForMaskedLM by simply using the from_pretrained(dir_name) method. The transformer weights will be re-used by the BertForMaskedLM and the weights corresponding to the multiple-choice classifier will be ignored.

XuhuiZhou commented 5 years ago

Hi! Thanks for answering me. And this is what I have done at first, which resulted in the following: image As you can see, the output tensors are all zeros, which seems to be really weird!

Although this might happen, I still want to confirm that I am doing the right thing, I basically calculating each masked word's probability. And some of them are zero which results in the final sentence zero probs. image

LysandreJik commented 5 years ago

Could you share a code snippet that reproduces what you're trying to do so that I can try and see on my side?

XuhuiZhou commented 5 years ago

For sure!

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import numpy as np
import math

# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
logging.basicConfig(level=logging.INFO)

def predict(text, bert_model, bert_tokenizer):
    # Tokenized input
    # text = "[CLS] I got restricted because Tom reported my reply [SEP]"
    text = "[CLS] " + text + " [SEP]"
    tokenized_text = bert_tokenizer.tokenize(text)
    # text = "[CLS] Stir the mixture until it is done [SEP]"
        #masked_index = 4
    sentence_prob = 1
    for masked_index in range(1,len(tokenized_text)-1):
        # Mask a token that we will try to predict back with `BertForMaskedLM`
        masked_word = tokenized_text[masked_index]
        #tokenized_text[masked_index] = '[MASK]'
        # assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
        # print (tokenized_text)

        # Convert token to vocabulary indices
        indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
        # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
        # segments_ids = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
        length = len(tokenized_text)
        segments_ids = [0 for _ in range(length)]
        # Convert inputs to PyTorch tensors
        tokens_tensor = torch.tensor([indexed_tokens])
        segments_tensors = torch.tensor([segments_ids])

        # If you have a GPU, put everything on cuda
        tokens_tensor = tokens_tensor.to('cuda')
        segments_tensors = segments_tensors.to('cuda')

        # Load pre-trained model (weights)
        # bert_model = BertForMaskedLM.from_pretrained('bert-large-uncased')
        # bert_model.eval()

        # If you have a GPU, put everything on cuda
        tokens_tensor = tokens_tensor.to('cuda')
        segments_tensors = segments_tensors.to('cuda')
        bert_model.to('cuda')

        # Predict all tokens
        with torch.no_grad():
            predictions = bert_model(tokens_tensor, segments_tensors)

        predictions = torch.nn.functional.softmax(predictions, -1)

        index = bert_tokenizer.convert_tokens_to_ids([masked_word])[0]

        curr_prob = predictions[0, masked_index][index]

        if curr_prob.item()!=0:
            #print(curr_prob.item())
            sentence_prob *= curr_prob.item()
        # predict_list = predictions[0, masked_index]

        #tokenized_text[masked_index] = masked_word
    #return math.pow(sentence_prob, 1/(len(tokenized_text)-3))
    return sentence_prob

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('./tmp/swag_output')
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('./tmp/swag_output')
model.eval()

# prob = predict(sentence_1, bert_model=model, bert_tokenizer=tokenizer)

with open("Sentence4leyang.txt", "r") as f:
    file = f.readlines()

num = len(file)
count = 0
curr = 0
for i in file:
    label, sentence_1, sentence_2, sentence_3 = i.split("\001")

    print (label[0])
    prob_1 = predict(sentence_1, bert_model=model, bert_tokenizer=tokenizer)
    prob_2 = predict(sentence_2, bert_model=model, bert_tokenizer=tokenizer)
    prob_3 = predict(sentence_3, bert_model=model, bert_tokenizer=tokenizer)
    answer = max(prob_1, prob_2, prob_3)
    print(prob_1, prob_2, prob_3)

For the txt file, you could just create some sentences to replace it. We used the weight after fine-tuning the Bert with official run_swag.py example.

thomwolf commented 5 years ago

If you finetuned a BertForMultipleChoice and load it in BertForMaskedLMsome weights will be initialized randomly and not trained.

This is indicated in this part of your output: image

If you use this model with un-trained weights you will have random output. You need to train these weights on a down-stream task.

XuhuiZhou commented 5 years ago

Hi, Thanks for the response. @thomwolf However, from my perspective, even if you use the vanilla Bert-base-uncased model, the BertForMaskedLM still runs perfectly without any random initialization. And I assume BertForMultipleChoice is simply the original Bert-base-uncased model with an additional linear classifier layer. Therefore, I think there should be a way to only keep the 'Bert model' but without the linear layer after fine-tuning. I think this feature could be really helpful for researchers to investigate the transferability of the models.

thomwolf commented 5 years ago

No unfortunately.

So the model used for pretraining bert and the one we provide on our AWS S3 bucket is BertForPretraining which has 2 heads: (i) the masked lm head and (ii) the next sentence prediction head.

BertForMaskedLM is a sub-set of BertForPretraining which keeps only the first head => all the weights are initialized with pretrained weights if you initialize it from the provided weights, you can use it out-of-the-box.

BertForMultipleChoice does NOT have a masked lm head and has instead a multiple-choice head => if you train this model and use it to initialize a BertForMaskedLM you won't initialize the language model head.

If you don't remember: just look at the log during model initialization. If it's written Weights from XXX not initialized from pretrained model it means you have to train the model before using it.

thomwolf commented 5 years ago

We will make the documentation more clear on that.

For your specific use-case, a solution could be to make a model your-self similarly to the way they are made in the library and keep the language modeling head as well as the other heads you want. And then fine-tune the newly added head on your dataset.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.