Closed Oxi84 closed 4 years ago
I solved one of the problems, using another way to load the model described bellow, but still it works way worse than BERT.
tokenizer = XLNetTokenizer.from_pretrained("xlnet-large-cased")
model = XLNetLMHeadModel.from_pretrained("xlnet-large-cased")
model.eval()
if torch.cuda.is_available(): model.to('cuda') #if we have a GPU
target_id = 5
input_ids = torch.tensor(tokenizer.encode("I believe my sister is <mask> because she eats a lot of vegetables .")).unsqueeze(0) # We will predict the masked token
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, target_id] = 1.0 # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float) # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, target_id] = 1.0 # Our first (and only) prediction will be the last token of the sequence (the masked token)
input_ids_tensor = input_ids.to("cuda")
target_mapping_tensor = target_mapping.to("cuda")
perm_mask_tensor = perm_mask.to("cuda")
with torch.no_grad():
predictions = model(input_ids_tensor, perm_mask=perm_mask_tensor, target_mapping=target_mapping_tensor)
predicted_k_indexes = torch.topk(predictions[0][0][0],k=10)
predicted_logits_list = predicted_k_indexes[0]
predicted_indexes_list = predicted_k_indexes[1]
print ("predicted word:",tokenizer.decode(input_ids[0][target_id].item()))
for i,item in enumerate(predicted_indexes_list):
the_index = predicted_indexes_list[i].item()
print("word and logits",tokenizer.decode(the_index),predicted_logits_list[i].item())
But the output is not so good, i believe Bert is better. I hope this is correct code to get masked word inside a sentence.
I am not sure if this line should be any different:
perm_mask[:, :, target_id] = 1.0 # Previous tokens don't see last token
output:
sentence: "I believe my sister is <mask> because she is a blonde ."
predicted word: <mask>
word and logits is -30.468482971191406
word and logits the -33.0710334777832
word and logits was -34.586158752441406
word and logits because -34.74900436401367
word and logits in -34.762718200683594
word and logits that -34.86489486694336
word and logits but -34.97043991088867
word and logits and -35.04599380493164
word and logits if -35.07524108886719
word and logits not -35.1640510559082
when i do not use perm_mask and call only:
predictions = model(input_ids_tensor, target_mapping=target_mapping_tensor)
I get a better, but still quite bad results, but it is at least interesting.
sentence: "I believe my sister is <mask> because she is a blonde ."
predicted word: <mask>
word and logits Colombian 25.14841651916504
word and logits a 25.1247615814209
word and logits the 25.11375617980957
word and logits Venezuelan 25.041296005249023
word and logits I 24.912843704223633
word and logits Beyonce 24.855722427368164
word and logits Jessica 24.557470321655273
word and logits in 24.518535614013672
word and logits paranoid 24.407917022705078
word and logits not 24.374282836914062
With bert base you get much better output, that makes much more sense [mainly adjectives]:
[('beautiful', 7.622010231018066), ('attractive', 6.6926116943359375), ('special', 6.309513568878174), ('crazy', 6.045520782470703), ('pretty', 5.968326091766357), ('lucky', 5.951317310333252), ('famous', 5.942074775695801), ('different', 5.920231819152832), ('gorgeous', 5.897611141204834), ('blonde', 5.834926605224609)]
I also did comparasion with Bert, so far just one example, but I found that Bert is much much better. I am not sure why is that ... but there must be a reason.
Agreed! There is a chance we are not using the permutation mask and target mapping correctly, but I am suspicious as the documentation's example is not working very well either.
The main reason you get bad performance is that XLNet is not good on short inputs (comes from the way it is pretrained, always having a long memory and only guessing a few words in the sequence).
The run_generation
example here will show you how to get better performances by adding a random text as initiator.
Aman Rusia also wrote a blog post about that here. We are using his solution in the run_generation
example.
Thanks, I am going to try the generation method and post the results here. Hope prediction is going to improve, but i guess that ading a lot of padding is make to slow the execution down a lot.
@Oxi84 Any luck with your results ? I still get pretty random results even while using this trick
Thank you suggestions.
After adding padding text, result is much more reasonable for predicting both middle masked token and text generation.
Some texting sample:
Input
text = 'The quick brown fox jumps <mask> the lazy dog.'
Output
The quick brown fox jumps above the lazy dog.
The quick brown fox jumps across the lazy dog.
Input
text = 'The <mask> brown fox jumps over the lazy dog.'
Output
The rapid brown fox jumps over the lazy dog.
The slow brown fox jumps over the lazy dog.
hey guys,
Q1) can someone give some more insight what @thomwolf explaining about? ''' https://github.com/huggingface/transformers/issues/846 The main reason you get bad performance is that XLNet is not good on short inputs (comes from the way it is pretrained, always having a long memory and only guessing a few words in the sequence).
The run_generation example here will show you how to get better performances by adding a random text as initiator. Aman Rusia also wrote a blog post about that here. We are using his solution in the run_generation example. ''' I can't understand the difference the way both Bert and XLnetLM works for LMhead task. Aren't both model having disadvantages if they have short sentence?
It seems he said XLnet has huge disadvantage on short input sentence while Bert does not(or has less disadvantage). Any detail explanation could be useful !
Q2)
Also, I can't get the point of adding extra padding or adding random padding things to improve XLnetLMHead model. Any snippet or explanation could be appreciated too...(saw the link but could not fully understood). I experimented by just adding extra strings of line:'I believe my sister is
Q3) https://github.com/huggingface/transformers/issues/846#issuecomment-513514039 Lastly, why do we have better result when we don't use perm_mask ? above link response shows that not having perm_mask option does give at least better result...But isn't perm_mask supposed to help to get better prediction and what author of paper used for SOTA ?
isn't perm_mask allow model to not seeing the next
Many thanks in advance !
I think these questions are not directly related to this repo. Maybe you should check out the paper or ask on quora or on researchgate
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I followed the example here: https://huggingface.co/pytorch-transformers/model_doc/xlnet.html#pytorch_transformers.XLNetModel
I found that I get completelly wrong output, I mean predicted word for the masked sentences are completelly irelevant and they change each run. I guess there is some bug, culd you please take a look at this:
code: ##############################
###########################
output (one example - it changes each run): ################################# input_ids tensor([[ 17, 11368, 19, 94, 2288, 27, 172, 6]]) predicted words:
word and logits emptiness 2.7753820419311523 word and logits Oklahoma 2.61531400680542 word and logits stars 2.56619930267334 word and logits bite 2.5252184867858887 word and logits Conte 2.4745044708251953 word and logits enforced 2.4537196159362793 word and logits antibody 2.4416041374206543 word and logits Got 2.332545280456543 word and logits Chev 2.31380033493042 word and logits MAG 2.3047127723693848
####################################