Closed devaansh100 closed 1 year ago
@ArthurZucker, when you have bandwidth, would you like to take a look at this?
Not stale, still looking forward to a response!
Hey! This is very similar to #18133. First, I was not really able to reproduce the bug as the output of
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50", src_lang="en_XX", tgt_lang="ro_RO")
src_text = " UN Chief Says There Is No Military Solution in Syria"
tgt_text = "Şeful ONU declară că nu există o soluţie militară în Siria"
model_inputs = tokenizer(src_text, text_target=tgt_text, return_tensors="pt").input_ids
Gave :
tensor([[250004, 8274, 127873, 25916, 7, 8622, 2071, 438, 67485,
53, 187895, 23, 51712, 2]])
But in any case, I think that you are right, the documentation for both model is not aligned as the input is not shifted in the tokenizer but rather in the model. This was already mentioned so might as well adresse it !
Hey! While the output of the tokenizer is correct(both input_ids and labels in the same format), the labels are going to pass through the shift_tokens_right
to create the decoder_input_ids
.
The shift_tokens_right
expects the LID token at the end, however, MBart50Tokenizer
will give an EOS token, therefore, the input to the decoder will end up being wrong.
Regarding the reproduction of the issue - do you mean reproducing the referenced issue? If yes, they are using the MBartTokenizer
, while the code mentioned here uses the MBart50Tokenizer
I'll come back to this soon 😉
It seems that this has been confusing a lot of people (including me, see #20610, ).
Let's work with your example:
en_XX UN Chief Says There Is No Military Solution in Syria</s>
ro_RO Şeful ONU declară că nu există o soluţie militară în Siria</s>
</s>ro_RO Şeful ONU declară că nu există o soluţie militară în Siria
(= decoder_inputs_ids)We are interested in supervised training where you feed the model with inputs_ids
and labels
. For most of the encoder decoder models, the labels are shifted to the right, so that the model will predict the next token in a MLM manner.
This means that if the decoder_input_ids
are
[ 2, 250020, 47711, 7844, 127666, 8, 18347, 18147, 1362, 315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577]
Then the model (if it is a perfect model) predicts
[250020, 47711, 7844, 127666, 8, 18347, 18147, 1362, 315, 42071, 36, 31563, 8454, 33796, 451, 346, 125577, 2]
Which is then compared to the loss.
That is also why when you generate (inference) you force the beginning token with </s>
.
Thanks for the clarification, @ArthurZucker! It still seems a bit wrong to expect the model to predict ro_RO
given </s>
. The comment wrap the last non pad token (the <LID> token)
in code is also somewhat confusing!
I agree with @LoicGrobol.
I also want to clarify this example from the docs:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
article_hi = "संयुक्त राष्ट्र के प्रमुख का कहना है कि सीरिया में कोई सैन्य समाधान नहीं है"
article_ar = "الأمين العام للأمم المتحدة يقول إنه لا يوجد حل عسكري في سوريا."
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
# translate Hindi to French
tokenizer.src_lang = "hi_IN"
encoded_hi = tokenizer(article_hi, return_tensors="pt")
generated_tokens = model.generate(**encoded_hi, forced_bos_token_id=tokenizer.lang_code_to_id["fr_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "Le chef de l 'ONU affirme qu 'il n 'y a pas de solution militaire en Syria."
# translate Arabic to English
tokenizer.src_lang = "ar_AR"
encoded_ar = tokenizer(article_ar, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"])
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
# => "The Secretary-General of the United Nations says there is no military solution in Syria."
That is also why when you generate (inference) you force the beginning token with
</s>
.
Is this the same forced_bos_token
mentioned in the code? If yes, then should we force it be the <\s>
token, rather than ro_RO
as done in the code?
Okay, indeed @LoicGrobol we should not compute the loss on all the forced decoder ids. This seems to apply to a few models, so will open a PR to fix these and add some documentation to properly explain all of this.
@devaansh100, no we have the bos_token_id # </s>
and the forced_decoder_ids #<LID>
which should ensures that we start with 2 tokens.
Thanks both of you for your feedback.
Note that currently this training seems to be needed to work well with generate
in e.g. mBART, which uses </s> LANGID
as its forced prompt (loss on the LANGID could indeed be skipped though). I guess that would have to changed too but I don't know how to make that change work with existing pretrained models.
Not sure I understand why we need to change the generate? We should not
The only logic that need to be updated is computing the loss on the lang token. The rest of the training procedure etc is still correct! It's just that we don't want the model to learn a distribution/update it's distribution when predicting the second token, because it can be variable at training, but will always be fixed at inference
Got it! I guess the only change then would be in the "labels" from the tokenizer then - where there was LANGID initially, we would have a 1/-100?
Yep! That should be it 😉
Thank you for all the help!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.23.1Who can help?
@patil-suraj @SaulLu
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The bug has been reproduced in the outputs of this colab notebook. The following are the steps to be followed:
/usr/local/bin/python3.7/dist-packages/transformers/models/mbart/modeling_mbart.py
), on line 1352(aboveoutputs = self.model(...
, after theif labels is not None
block), addprint(f'Decoder Input Ids: {decoder_input_ids}\nLabels: {labels}')
.Expected behavior
I was looking into fine-tuning
facebook/mbart-large-50
through this example in the documentation. As per the description, the expected input for the model is of the form[lang_id] tokens [eos]
for both the encoder and the decoder.While the
MBart50Tokenizer
produces outputs in the expected format, thedecoder_input_ids
get transformed to an incorrect one -[eos] [lang_id] tokens
. Specifically, I believe the output should have been the following(do correct me if I am wrong here though):This is caused since the
shift_tokens_right
function does not seem to be adapted for mbart50. As per the docstring of this function,however, for an mbart50, the last non pad token would be an
eos
.Additional question: Why should the
[eos]
token predict the[lang_id]
? This happens in both mbart and mbart50. If not, should the last token in the labels be-100
? If yes, there would be a subsequent issue, since the labels matrix from the tokenizer seems to be using1
as the padding token instead of-100
. Do let me know if I would be required to open the same!If this bug seems legitimate, I would be glad to provide a fix for the same! I believe the
labels
key from MBart50Tokenizer would have to be updated to give the same output as the MBartTokenizer.