Open tonywu71 opened 1 year ago
If the self.processor.tokenizer.bos_token_id
is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this one) then I don't really see a problem
If you look at the issue I mentioned, it seems that you stated the opposite. Here's what you wrote for context:
I'll update the documentation to make it less confusing. The token used to store the
"<|startoftranscript|>"
token isdecoder_start_token_id
. Thebos_token
is pretty much unused, which is why it was set to the same aseos_token
.
I might be wrong. But forced_decoder_ids
is only used with generate
and not with forward
. And I checked in the forward
source code: with the current collator, the batch["labels"]
would always start with <|startoftranscript|>
. However, a redundant <|startoftranscript|>
would also be appended when using forward
with the labels
argument and when decoder_input_ids is None
(see source code). Let me know if you think I'm wrong.
cc @sanchit-gandhi
In the unit 5 of the audio course, the following code is used:
However, according to the following issue,
bos_token_id
shouldn't be used (@ArthurZucker). In my opinion, this should be replaced withself.processor.tokenizer.convert_tokens_to_ids("<|startoftranscript|>")
or withmodel.config.decoder_start_token_id
. What do you think?Note if this is true, then there would be a similar error in @sanchit-gandhi's fine-tuning tutorial too.
Thanks for your attention.
Regards, Tony