huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.72k stars 26.94k forks source link

Misalignment between documentation and implementation of mBART50 tokenisation for the decoder #19500

Closed devaansh100 closed 1 year ago

devaansh100 commented 2 years ago

System Info

Who can help?

@patil-suraj @SaulLu

Information

Tasks

Reproduction

The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:

  1. Make a copy of the notebook.
  2. Execute the first 2 cells.
  3. In the source file for mbart(/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py), on line 1352(above outputs = self.model(..., after the if labels is not None block), add print(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}').
  4. Restart the runtime for the changes in the library to take place.
  5. Run the third cell. The output is:
    Decoder Input Ids: tensor([[     2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]])
    Labels: tensor([[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2]])

Expected behavior

I was looking into fine-tuning facebook/mbart-large-50 through this example in the documentation. As per the description, the expected input for the model is of the form [lang_id] tokens [eos] for both the encoder and the decoder.

While the MBart50Tokenizer produces outputs in the expected format, the decoder_input_ids get transformed to an incorrect one - [eos] [lang_id] tokens. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):

Decoder Input Ids: tensor([[   250020,  47711,   7844, 127666,      8,  18347,  18147,   1362,
            315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]])
Labels: tensor([[47711,   7844, 127666,      8,  18347,  18147,   1362,    315,
          42071,     36,  31563,   8454,  33796,    451,    346, 125577,      2,  250020]])

This is caused since the shift_tokens_right function does not seem to be adapted for mbart50. As per the docstring of this function,

wrap the last non pad token (the [LID] token)

however, for an mbart50, the last non pad token would be an eos.

Additional question: Why should the [eos] token predict the [lang_id]? This happens in both mbart and mbart50. If not, should the last token in the labels be -100? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using 1 as the padding token instead of -100. Do let me know if I would be required to open the same!

If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the labels key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.

LysandreJik commented 2 years ago

@ArthurZucker, when you have bandwidth, would you like to take a look at this?

devaansh100 commented 1 year ago

Not stale, still looking forward to a response!

ArthurZucker commented 1 year ago

Hey! This is very similar to #18133. First, I was not really able to reproduce the bug as the output of

tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")

src_text = " UN Chief Says There Is No Military Solution in Syria"
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"

model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt").input_ids

Gave :

tensor([[250004,   8274, 127873,  25916,      7,   8622,   2071,    438,  67485,
             53, 187895,     23,  51712,      2]])

But in any case, I think that you are right, the documentation for both model is not aligned as the input is not shifted in the tokenizer but rather in the model. This was already mentioned so might as well adresse it !

devaansh100 commented 1 year ago

Hey! While the output of the tokenizer is correct(both input_ids and labels in the same format), the labels are going to pass through the shift_tokens_right to create the decoder_input_ids.

The shift_tokens_right expects the LID token at the end, however, MBart50Tokenizer will give an EOS token, therefore, the input to the decoder will end up being wrong.

Regarding the reproduction of the issue - do you mean reproducing the referenced issue? If yes, they are using the MBartTokenizer, while the code mentioned here uses the MBart50Tokenizer

ArthurZucker commented 1 year ago

I'll come back to this soon 😉

ArthurZucker commented 1 year ago

It seems that this has been confusing a lot of people (including me, see #20610, ).

Let's work with your example:

We are interested in supervised training where you feed the model with inputs_ids and labels. For most of the encoder decoder models, the labels are shifted to the right, so that the model will predict the next token in a MLM manner.

This means that if the decoder_input_ids are

[ 2, 250020,  47711,   7844, 127666,      8,  18347,  18147,   1362, 315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577]

Then the model (if it is a perfect model) predicts

[250020,  47711,   7844, 127666,      8,  18347,  18147,   1362, 315,  42071,     36,  31563,   8454,  33796,    451,    346, 125577, 2]

Which is then compared to the loss. That is also why when you generate (inference) you force the beginning token with </s>.

LoicGrobol commented 1 year ago

Thanks for the clarification, @ArthurZucker! It still seems a bit wrong to expect the model to predict ro_RO given </s>. The comment wrap the last non pad token (the <LID> token) in code is also somewhat confusing!

devaansh100 commented 1 year ago

I agree with @LoicGrobol.

I also want to clarify this example from the docs:

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."

model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")

# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."

# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."

That is also why when you generate (inference) you force the beginning token with </s>.

Is this the same forced_bos_token mentioned in the code? If yes, then should we force it be the <\s> token, rather than ro_RO as done in the code?

ArthurZucker commented 1 year ago

Okay, indeed @LoicGrobol we should not compute the loss on all the forced decoder ids. This seems to apply to a few models, so will open a PR to fix these and add some documentation to properly explain all of this.

@devaansh100, no we have the bos_token_id # </s> and the forced_decoder_ids #<LID> which should ensures that we start with 2 tokens.

Thanks both of you for your feedback.

LoicGrobol commented 1 year ago

Note that currently this training seems to be needed to work well with generate in e.g. mBART, which uses </s> LANGID as its forced prompt (loss on the LANGID could indeed be skipped though). I guess that would have to changed too but I don't know how to make that change work with existing pretrained models.

ArthurZucker commented 1 year ago

Not sure I understand why we need to change the generate? We should not

ArthurZucker commented 1 year ago

The only logic that need to be updated is computing the loss on the lang token. The rest of the training procedure etc is still correct! It's just that we don't want the model to learn a distribution/update it's distribution when predicting the second token, because it can be variable at training, but will always be fixed at inference

devaansh100 commented 1 year ago

Got it! I guess the only change then would be in the "labels" from the tokenizer then - where there was LANGID initially, we would have a 1/-100?

ArthurZucker commented 1 year ago

Yep! That should be it 😉

devaansh100 commented 1 year ago

Thank you for all the help!

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.