huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.94k stars 27k forks source link

XLNET completely wrong and random output #846

Closed Oxi84 closed 4 years ago

Oxi84 commented 5 years ago

I followed the example here: https://huggingface.co/pytorch-transformers/model_doc/xlnet.html#pytorch_transformers.XLNetModel

I found that I get completelly wrong output, I mean predicted word for the masked sentences are completelly irelevant and they change each run. I guess there is some bug, culd you please take a look at this:

code: ##############################

config = XLNetConfig.from_pretrained('xlnet-large-cased')
tokenizer = XLNetTokenizer.from_pretrained('xlnet-large-cased')'
model = XLNetLMHeadModel(config)
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is very <..mask..> ")).unsqueeze(0)  # We will 
predict the masked token
print("input_ids",input_ids)
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, -1] = 1.0  # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, -1] = 1.0 
predictions = model(input_ids, perm_mask=perm_mask, target_mapping=target_mapping)

predicted_k_indexes = torch.topk(predictions[0],k=10)
predicted_logits_list = predicted_k_indexes[0] 
predicted_indexes_list = predicted_k_indexes[1]    

print ("predicted <masked> words:")
for i,item  in enumerate(predicted_indexes_list[0][0]):
    the_index = predicted_indexes_list[0][0][i].item()
    print("word and logits",tokenizer.decode(the_index),predicted_logits_list[0][0][i].item())

###########################

output (one example - it changes each run): ################################# input_ids tensor([[ 17, 11368, 19, 94, 2288, 27, 172, 6]]) predicted words:

word and logits emptiness 2.7753820419311523 word and logits Oklahoma 2.61531400680542 word and logits stars 2.56619930267334 word and logits bite 2.5252184867858887 word and logits Conte 2.4745044708251953 word and logits enforced 2.4537196159362793 word and logits antibody 2.4416041374206543 word and logits Got 2.332545280456543 word and logits Chev 2.31380033493042 word and logits MAG 2.3047127723693848

####################################

Oxi84 commented 5 years ago

I solved one of the problems, using another way to load the model described bellow, but still it works way worse than BERT.

tokenizer = XLNetTokenizer.from_pretrained("xlnet-large-cased")
model = XLNetLMHeadModel.from_pretrained("xlnet-large-cased")
model.eval()
if torch.cuda.is_available(): model.to('cuda') #if we have a GPU 
target_id = 5
input_ids = torch.tensor(tokenizer.encode("I believe my sister is <mask> because she eats a lot of vegetables .")).unsqueeze(0)  # We will predict the masked token
perm_mask = torch.zeros((1, input_ids.shape[1], input_ids.shape[1]), dtype=torch.float)
perm_mask[:, :, target_id] = 1.0  # Previous tokens don't see last token
target_mapping = torch.zeros((1, 1, input_ids.shape[1]), dtype=torch.float)  # Shape [1, 1, seq_length] => let's predict one token
target_mapping[0, 0, target_id] = 1.0  # Our first (and only) prediction will be the last token of the sequence (the masked token)

input_ids_tensor = input_ids.to("cuda")
target_mapping_tensor = target_mapping.to("cuda")
perm_mask_tensor = perm_mask.to("cuda")

with torch.no_grad():
    predictions = model(input_ids_tensor, perm_mask=perm_mask_tensor, target_mapping=target_mapping_tensor)

predicted_k_indexes = torch.topk(predictions[0][0][0],k=10)
predicted_logits_list = predicted_k_indexes[0] 
predicted_indexes_list = predicted_k_indexes[1] 

print ("predicted word:",tokenizer.decode(input_ids[0][target_id].item()))
for i,item  in enumerate(predicted_indexes_list):
    the_index = predicted_indexes_list[i].item()
    print("word and logits",tokenizer.decode(the_index),predicted_logits_list[i].item())

But the output is not so good, i believe Bert is better. I hope this is correct code to get masked word inside a sentence.

I am not sure if this line should be any different:

 perm_mask[:, :, target_id] = 1.0  # Previous tokens don't see last token

output:

 sentence: "I believe my sister is <mask> because she is a blonde ."
 predicted word: <mask>
 word and logits is -30.468482971191406
 word and logits the -33.0710334777832
 word and logits was -34.586158752441406
 word and logits because -34.74900436401367
 word and logits in -34.762718200683594
 word and logits that -34.86489486694336
 word and logits but -34.97043991088867
 word and logits and -35.04599380493164
 word and logits if -35.07524108886719
  word and logits not -35.1640510559082

when i do not use perm_mask and call only:

  predictions = model(input_ids_tensor, target_mapping=target_mapping_tensor)

I get a better, but still quite bad results, but it is at least interesting.

 sentence: "I believe my sister is <mask> because she is a blonde ."
predicted word: <mask>
word and logits Colombian 25.14841651916504
word and logits a 25.1247615814209
word and logits the 25.11375617980957
word and logits Venezuelan 25.041296005249023
word and logits I 24.912843704223633
word and logits Beyonce 24.855722427368164
word and logits Jessica 24.557470321655273
word and logits in 24.518535614013672
word and logits paranoid 24.407917022705078
word and logits not 24.374282836914062

With bert base you get much better output, that makes much more sense [mainly adjectives]:

  [('beautiful', 7.622010231018066), ('attractive', 6.6926116943359375), ('special', 6.309513568878174), ('crazy', 6.045520782470703), ('pretty', 5.968326091766357), ('lucky', 5.951317310333252), ('famous', 5.942074775695801), ('different', 5.920231819152832), ('gorgeous', 5.897611141204834), ('blonde', 5.834926605224609)]
Oxi84 commented 5 years ago

I also did comparasion with Bert, so far just one example, but I found that Bert is much much better. I am not sure why is that ... but there must be a reason.

domaala commented 5 years ago

Agreed! There is a chance we are not using the permutation mask and target mapping correctly, but I am suspicious as the documentation's example is not working very well either.

thomwolf commented 5 years ago

The main reason you get bad performance is that XLNet is not good on short inputs (comes from the way it is pretrained, always having a long memory and only guessing a few words in the sequence).

The run_generation example here will show you how to get better performances by adding a random text as initiator.

Aman Rusia also wrote a blog post about that here. We are using his solution in the run_generation example.

Oxi84 commented 5 years ago

Thanks, I am going to try the generation method and post the results here. Hope prediction is going to improve, but i guess that ading a lot of padding is make to slow the execution down a lot.

cpcdoy commented 5 years ago

@Oxi84 Any luck with your results ? I still get pretty random results even while using this trick

makcedward commented 5 years ago

Thank you suggestions.

After adding padding text, result is much more reasonable for predicting both middle masked token and text generation.

Some texting sample: Input text = 'The quick brown fox jumps <mask> the lazy dog.' Output

The quick brown fox jumps above the lazy dog.
The quick brown fox jumps across the lazy dog.

Input text = 'The <mask> brown fox jumps over the lazy dog.' Output

The rapid brown fox jumps over the lazy dog.
The slow brown fox jumps over the lazy dog.
muiPomeranian commented 5 years ago

hey guys,

Q1) can someone give some more insight what @thomwolf explaining about? ''' https://github.com/huggingface/transformers/issues/846 The main reason you get bad performance is that XLNet is not good on short inputs (comes from the way it is pretrained, always having a long memory and only guessing a few words in the sequence).

The run_generation example here will show you how to get better performances by adding a random text as initiator. Aman Rusia also wrote a blog post about that here. We are using his solution in the run_generation example. ''' I can't understand the difference the way both Bert and XLnetLM works for LMhead task. Aren't both model having disadvantages if they have short sentence?

It seems he said XLnet has huge disadvantage on short input sentence while Bert does not(or has less disadvantage). Any detail explanation could be useful !

Q2) Also, I can't get the point of adding extra padding or adding random padding things to improve XLnetLMHead model. Any snippet or explanation could be appreciated too...(saw the link but could not fully understood). I experimented by just adding extra strings of line:'I believe my sister is because she is a blonde ' + ' ' and it gives much better result than not having at the end....

Q3) https://github.com/huggingface/transformers/issues/846#issuecomment-513514039 Lastly, why do we have better result when we don't use perm_mask ? above link response shows that not having perm_mask option does give at least better result...But isn't perm_mask supposed to help to get better prediction and what author of paper used for SOTA ?

isn't perm_mask allow model to not seeing the next tokens in the given input while can see the previous tokens? According to the paper and the original code, I could see that if permute order is 3->4->1->2, mask=1,3, then model cannot see masked<1> when it tried to predict masked<3> but the reverse is possible.

Many thanks in advance !

iedmrc commented 4 years ago

I think these questions are not directly related to this repo. Maybe you should check out the paper or ask on quora or on researchgate

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.