Closed ceatlinar closed 4 years ago
Would it be possible to use XML-R #1769 ? Its model has a simple description ( Masked Language Models
in chapter 3) and is similar to BERT-Base besides tokenization, training configuration and language embeddings.
Hi Thanks for the advice but idk if the model you mentioned has a pretrained one for Turkish because I need to use it for Turkish. Also it is kind of a need for me to use the model I asked for prediction. Any tips on how I could use that model for getting masked word prediction would be great. Thanks in advance
There are also multilingual, pretrained models for BERT, which we could try. Usually the quality decreases in large, multilingual models with very different languages.
But they have mostly the similar architecture like bert-base
, so we could try to rerun the linked example with the line modelpath = "bert-base-multilingual-cased"
.
I get the following warning and error when trying modelpath = "bert-base-multilingual-cased":
Sorry I am not familiar with the transformers so it may be an easy error to fix but Idk how
The pre-trained model you are loading is a cased model but you have not set do_lower_case
to False. We are setting do_lower_case=False
for you but you may want to check this behavior.
Traceback (most recent call last):
File "e.py", line 13, in
'hungry' is in the list, but as two tokens since the multilingual model has a different vocabulary. Therefore, we have to tokenize the target word. Check this out:
#!/usr/bin/python3
#
# first Axiom: Aaron Swartz is everything
# second Axiom: The Schwartz Space is his discription of physical location
# first conclusion: His linear symmetry is the Fourier transform
# second conclusion: His location is the Montel space
# Third conclusion: His location is the Fréchet space
import torch
from transformers import BertModel, BertTokenizer, BertForMaskedLM
modelname = "bert-base-multilingual-cased"
tokenizer = BertTokenizer.from_pretrained(modelname)
model = BertModel.from_pretrained(modelname)
def predictMask(maskedText, masked_index):
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(maskedText)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [1] * len(maskedText)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained(modelname)
model.eval()
# Predict all tokens
predictions = model(tokens_tensor, segments_tensors)
predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
print("Original:", text)
print("Masked:", " ".join(maskedText))
print("Predicted token:", predicted_token)
maskedText[masked_index] = predicted_token[0]
# delete this section for faster inference
print("Other options:")
# just curious about what the next few options look like.
for i in range(10):
predictions[0][0][masked_index][predicted_index] = -11100000
predicted_index = torch.argmax(predictions[0][0][masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])
print(predicted_token)
print("Masked, tokenized text with the prediction:", maskedText)
return maskedText
text = "let´s go fly a kite!"
target = "kite"
tokenized_text = tokenizer.tokenize(text)
tokenized_target = tokenizer.tokenize(target)
print("tokenized text:", tokenized_text)
print("tokenized target:", tokenized_target)
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = tokenized_text.index(tokenized_target[0])
for i in range(len(tokenized_target)):
tokenized_text[masked_index+i] = '[MASK]'
for i in range(len(tokenized_target)):
tokenized_text = predictMask(tokenized_text, masked_index+i)
I tried the code but it's giving word pieces suggestions, not whole word. And the suggestions are poor. Thank you so much for your effort but this is not useful for me unless somehow I could get whole word suggestions. Also, I am still seeking for an implementation of xlm model to get prediction, of anyone could help, that would be great
Don't the pieces build complete words in the end? Read my first answer for XML, the mentioned model supports the turkish language.
Hi, you can predict a masked word with XLM as you would do with any other MLM-based model. Here's an example using the checkpoint xlm-mlm-17-1280
you mentioned:
from transformers import XLMTokenizer, XLMWithLMHeadModel
import torch
# load tokenizer
tokenizer = XLMTokenizer.from_pretrained("xlm-mlm-17-1280")
# encode sentence with a masked token in the middle
sentence = torch.tensor([tokenizer.encode("This was the first time Nicolas ever saw a " + tokenizer.mask_token + ". It was huge.")])
# Identify the masked token position
masked_index = torch.where(sentence == tokenizer.mask_token_id)[1].tolist()[0]
# Load model
model = XLMWithLMHeadModel.from_pretrained("xlm-mlm-17-1280")
# Get the five top answers
result = model(sentence)
result = result[0][:, masked_index].topk(5).indices
result = result.tolist()[0]
print(tokenizer.decode(result))
# monster dragon snake wolf tiger
Thank you so much guys for the replies, they been very helpfull.
Hi I would like some help with how to use pretrained xlm-mlm-17-1280 model to get predictions for masked word prediction. I have followed http://mayhewsw.github.io/2019/01/16/can-bert-generate-text/ for BERT mask prediction and it is working. Could you help me with how to use xlm-mlm-17-1280 model for word prediction. I need to get prediction for Turkish Language which is one of the languages in 17 languages