Open brando90 opened 1 year ago
cc: @pacman100
did you find a solution to this? I ran into this issue while trying to use trlx for RLHF using the falcon model.
I ran into the same with llama-2. I was wondering, what if you use the BOS token and add the padding on the left?
The issue is actually with the Collator for Language Modelling, which happens to mask out all Padding tokens. This, of course, includes the EOS token - so there is no computation of loss for the EOS token and consequently, the model doesn't learn to use one.
To fix this, use the Seq2Seq collator and setup the masking yourself. The advantage is that you also have the choice to have loss computed (and consequently backpropagate) only on generated tokens, rather than prompt/input tokens. That can also be useful for efficient instruction tuning.
If you need the code for this I can provide it later, but just have a look on how to set up Seq2Seq collator and it should be obvious.
@imarquart
Could you please elaborate on your point regarding the difference between Language Modeling collator and Seq-to-Seq collator? So, when we have a LM model like Llama, we can still use a seq-to-seq data collator? I have been trying to fine tune Llama and Mistral with a LM data collator and it seems that the eos token does not get generated and generation does not stop (following the common practice, I'm setting tokenizer.pad_token = tokenizer.eos_token)
Somehow, manually adding an EOS token "" to the samples fixed it for me. This was despite already having the add_eos=True in the AutoTokenizer. I'm using the DataCollatorForCompletionOnlyLM as collator, which takes the tokenizer as an input.
Somehow, manually adding an EOS token "" to the samples fixed it for me. This was despite already having the add_eos=True in the AutoTokenizer. I'm using the DataCollatorForCompletionOnlyLM as collator, which takes the tokenizer as an input.
Hmm, this is not working for me. Any extra EOS tokens appended are still ignored by DataCollatorForCompletionOnlyLM
I saw the falcon blog: https://github.com/huggingface/blog/blob/main/falcon.md and here: https://huggingface.co/blog/falcon.
I tried using it but I noticed setting eos = pad leads to the issue where a fine-tuned model never generates EOS which is a problem. What is the proper way to fix this?
Who can help? @lvwerra @younesbelkada @smangrul @pacman100 @lewtun @OlivierDehaene @pcuenca @philschmid @osanseviero
Details:
The HF falcon tutorial has the following line:
it looks strange to me. It make sense pad and eos are the same but then why even make a difference between them in the first place in general?
Note its wrong to do pad = eos. This means during fine-tuning the model will never be trained to output eos (most likely) since eos is treated as pad token and no back propagated:
I saw this (here https://github.com/huggingface/transformers/issues/22794):
But this assumes the model has a pad_token. I think an additional check has to be done that it does have an embedding for pad_token so that there are no run time errors (~type errors in the matrix extraction from the embedding "table"/matrix).
But if one does that some care might be needed to initialize the new token so that it dominates the generation: https://nlp.stanford.edu/~johnhew/vocab-expansion.html
code:
Modifying model gives issues
Darn this still not works:
code:
it doesn't like the modifications to the model:
How to fix?
Errors:
cross: