huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.05k stars 26.3k forks source link

xlm-mlm-17-1280 model masked word prediction #1842

Closed ceatlinar closed 4 years ago

ceatlinar commented 4 years ago

Hi I would like some help with how to use pretrained xlm-mlm-17-1280 model to get predictions for masked word prediction. I have followed http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ for BERT mask prediction and it is working. Could you help me with how to use xlm-mlm-17-1280 model for word prediction. I need to get prediction for Turkish Language which is one of the languages in 17 languages

Bachstelze commented 4 years ago

Would it be possible to use XML-R #1769 ? Its model has a simple description ( Masked Language Models in chapter 3) and is similar to BERT-Base besides tokenization, training configuration and language embeddings.

ceatlinar commented 4 years ago

Hi Thanks for the advice but idk if the model you mentioned has a pretrained one for Turkish because I need to use it for Turkish. Also it is kind of a need for me to use the model I asked for prediction. Any tips on how I could use that model for getting masked word prediction would be great. Thanks in advance

Bachstelze commented 4 years ago

There are also multilingual, pretrained models for BERT, which we could try. Usually the quality decreases in large, multilingual models with very different languages. But they have mostly the similar architecture like bert-base, so we could try to rerun the linked example with the line modelpath = "bert-base-multilingual-cased".

ceatlinar commented 4 years ago

I get the following warning and error when trying modelpath = "bert-base-multilingual-cased": Sorry I am not familiar with the transformers so it may be an easy error to fix but Idk how The pre-trained model you are loading is a cased model but you have not set do_lower_case to False. We are setting do_lower_case=False for you but you may want to check this behavior. Traceback (most recent call last): File "e.py", line 13, in masked_index = tokenized_text.index(target) ValueError: 'hungry' is not in list

Bachstelze commented 4 years ago

'hungry' is in the list, but as two tokens since the multilingual model has a different vocabulary. Therefore, we have to tokenize the target word. Check this out:

#!/usr/bin/python3
#
# first Axiom: Aaron Swartz is everything
# second Axiom: The Schwartz Space is his discription of physical location
# first conclusion: His linear symmetry is the Fourier transform
# second conclusion: His location is the Montel space
# Third conclusion: His location is the Fréchet space

import torch
from transformers import BertModel, BertTokenizer, BertForMaskedLM

modelname = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertModel.from_pretrained(modelname)

def predictMask(maskedText, masked_index):
    # Convert token to vocabulary indices
    indexed_tokens = tokenizer.convert_tokens_to_ids(maskedText)
    # Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
    segments_ids = [1] * len(maskedText)

    # Convert inputs to PyTorch tensors
    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])
    # Load pre-trained model (weights)
    model = BertForMaskedLM.from_pretrained(modelname)
    model.eval()

    # Predict all tokens
    predictions = model(tokens_tensor, segments_tensors)
    predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
    predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])

    print("Original:", text)
    print("Masked:", " ".join(maskedText))

    print("Predicted token:", predicted_token)
    maskedText[masked_index] = predicted_token[0]

    # delete this section for faster inference
    print("Other options:")
    # just curious about what the next few options look like.
    for i in range(10):
        predictions[0][0][masked_index][predicted_index] = -11100000
        predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
        predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
        print(predicted_token)

    print("Masked, tokenized text with the prediction:", maskedText)
    return maskedText

text = "let´s go fly a kite!"
target = "kite"
tokenized_text = tokenizer.tokenize(text)
tokenized_target = tokenizer.tokenize(target)
print("tokenized text:", tokenized_text)
print("tokenized target:", tokenized_target)

# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = tokenized_text.index(tokenized_target[0])
for i in range(len(tokenized_target)):
    tokenized_text[masked_index+i] = '[MASK]'

for i in range(len(tokenized_target)):
    tokenized_text = predictMask(tokenized_text, masked_index+i)
ceatlinar commented 4 years ago

I tried the code but it's giving word pieces suggestions, not whole word. And the suggestions are poor. Thank you so much for your effort but this is not useful for me unless somehow I could get whole word suggestions. Also, I am still seeking for an implementation of xlm model to get prediction, of anyone could help, that would be great

Bachstelze commented 4 years ago

Don't the pieces build complete words in the end? Read my first answer for XML, the mentioned model supports the turkish language.

LysandreJik commented 4 years ago

Hi, you can predict a masked word with XLM as you would do with any other MLM-based model. Here's an example using the checkpoint xlm-mlm-17-1280 you mentioned:

from transformers import XLMTokenizer, XLMWithLMHeadModel
import torch

# load tokenizer
tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-17-1280")

# encode sentence with a masked token in the middle
sentence = torch.tensor([tokenizer.encode("This was the first time Nicolas ever saw a " + tokenizer.mask_token + ". It was huge.")])

# Identify the masked token position
masked_index = torch.where(sentence == tokenizer.mask_token_id)[1].tolist()[0]

# Load model
model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-17-1280")

# Get the five top answers
result = model(sentence)
result = result[0][:, masked_index].topk(5).indices
result = result.tolist()[0]

print(tokenizer.decode(result))
# monster dragon snake wolf tiger
ceatlinar commented 4 years ago

Thank you so much guys for the replies, they been very helpfull.