Closed seanexp closed 8 months ago
Hi @seanexp Thanks for the issue and for your message, I think this might indicate a bug on your dataset, also are you doing full fine-tuning or you're using QLoRA ?
Hi @younesbelkada !
I was using Anthropic/hh-rlhf
from huggingface hub.
Also, full fine-tuning was applied.
I also observed similar pattern from CarperAI/openai_summarize_comparisons
dataset.
Same issues: 'loss': 0.6914, eval_loss': 0.69140625, 'eval_accuracy': 1.0,
the output logits of chosen and rejected sample both become larger and larger (until 300+ ) during training.
@seanexp buddy, have you solve the problem?~~~
@zhengyanzhao1997
Still, I haven't. I tried to dig into compute_loss
and prediction_step
but couldn't find the cause
Me too... I try to add L2Norm loss on the output logits, but it didn't work 😢
Well, my rough guess is that the problem has nothing to do with regularization.
But sadly, I have no other hypothesis :(
When all input lengths are padded to the same length, loss decreases normally. [My experimental results]
But I don't understand why this is happening either
Hi @zhengyanzhao1997
I think I found one potential cause of weird RM behavior.
Currently, RewardTrainer
expects tokenizer.padding_side == "right"
, as can be seen below.
However, some models' default tokenizer.padding_side
is "left" (Mistral-7B
for example).
So we should set padding_side = "right"
before train/eval.
@younesbelkada
If you think the modification is necessary, I'll send PR so that RewardTrainer raise error if tokenizer.padding_side == "left"
I also found out that examples/scripts/reward_modeling.py
work as expected, but simply replacing facebook/opt-350m
to EleutherAI/pythia-1.4b
shows weird training loss and accuracy.
Now I suspect the werid behavior is due to tokenizer or model config.
Hi @zhengyanzhao1997
I think I found one potential cause of weird RM behavior.
Currently,
RewardTrainer
expectstokenizer.padding_side == "right"
, as can be seen below.However, some models' default
tokenizer.padding_side
is "left" (Mistral-7B
for example).So we should set
padding_side = "left"
before train/eval.@younesbelkada If you think the modification is necessary, I'll send PR so that RewardTrainer raise error if
tokenizer.padding_side == "left"
After you completed such modifications, was your training successful?~~~
No, it wasn't. Still, simply changing facebook/opt-350m
model to something else leads to weird behavior.
No, it wasn't. Still, simply changing
facebook/opt-350m
model to something else leads to weird behavior.
I have the same issue,have you solved the problem? I also use the Mistral-7b. The training loss is also 0.6934. The chosen and rejected scores are the same, maybe 195 large. The eval acc is 1.0
So we should set padding_side = "right" before train/eval.
@supermancmk
Did you change into padding_side = "right"
?. Mistral-7b
's default padding_side
is left.
If the training doesn't go well afterwards, please leave a comment here
So we should set padding_side = "right" before train/eval.
@supermancmk Did you change into
padding_side = "right"
?.Mistral-7b
's defaultpadding_side
is left.If the training doesn't go well afterwards, please leave a comment here Yes, I change the tokenizer.padding_side = "right". But it's also the same problem
This is very weird... @younesbelkada do you have anything in mind?
@seanexp Do you have solved the problem by padding_side = "right"? or other ways solved . And do you try the llama model?
I tried Mistral-7b
, pythia-1.4b
, and bloom-560m
and only bloom-560m
seems to work. I couldn't solved the problem by padding_side = "right"
.
I tried
Mistral-7b
,pythia-1.4b
, andbloom-560m
and onlybloom-560m
seems to work. I couldn't solved the problem bypadding_side = "right"
.
Thanks~
@zhengyanzhao1997 @supermancmk
I think I found a solution and it seems to work (at least for the early phase of training)
I added nn.init.zeros_(model.score.weight)
after model loading and here are the logged metrics
pythia-1.4b
model without zero initbloom-560m
model without zero init (it worked from the beginning)pythia-1.4b
model with zero initthe experiments are done using CarperAI/openai_summarize_comparisons
dataset.
Please let me know if this solution works in your case.
@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)
@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)
Do you try the deepspeed zero3? Is it work
@seanexp We successfully trained our 7B reward model through zero1 (just turn deepspeed zero3 to zero1)
Do you try the deepspeed zero3? Is it work
NO😭
@supermancmk
I tried deepspeed zero2. Well, I can't see any connection between deepspeed zero stage and this weird behavior...
zero3 using deepspeed-chat steps training scripts. same issue. loss = 0.69140625
after nn.init.zeros_(model.score.weight), same issue still.
direct cause is that input of sigmod is zero. so the loss is equal to 0.69140625 when dtype is bfloat16. no related to zero stage settings. And in my case the issue came from data processing: chosen fields and rejected fields are the same. With data fixed, loss is decreased as expected.
zero3 using deepspeed-chat steps training scripts. same issue. loss = 0.69140625 after nn.init.zeros_(model.score.weight), same issue still.
direct cause is that input of sigmod is zero. so the loss is equal to 0.69140625 when dtype is bfloat16. no related to zero stage settings. And in my case the issue came from data processing: chosen fields and rejected fields are the same. With data fixed, loss is decreased as expected.
hello! What do you mean about chosen fields and rejected fields are the same?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I have the same issue, with loss = 0.6914. Does anyone solved the problem?
Training loss converges to 0.691 which means that the reward model cannot tell the difference between chosen and rejected.
How is this inferred?
I added nn.init.zeros_(model.score.weight) after model loading and here are the logged metrics
OpenAI's model init is the following if you want to give it a try.
Meanwhile, the accuracy on test dataset is close to 1, which means the model can almost perfectly tell the difference between chosen and rejected.
This doesn't sound very likely. Can you try evaluate the models manually on say 100 samples to see how it actually goes?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Hi, I only achieve about 60% accuracy on the HH eval set by using the example reward modeling script (I have tried facebook/opt-350m and mistral-7b) as the reward models. Does anyone knows why?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I hit same issue here with deepspeed stage 3
I have the same issue, with loss = 0.6914. Does anyone solved the problem?
I have the same issue. I trained Pythia models (410m, 1.4b) by RewardTrainer
but the loss are just stuck at 0.69. I print the reward of chosen
and rejected
but they are just really closed, which may indicates that the model did not learned how to distinguish between chosen and rejected completions. I've checked that the completion of chosen
and rejected
are not the same. Is there anybody know exactly why this error occurs?
@younesbelkada do you have any thought about this issue?
Hello, I've fixed the issue by specifying model.config.pad_token_id = tokenizer.pad_token_id
. (Make sure your model.config.pad_token_id
is not None.)
If the model.config.pad_token_id
is set to None, it selects the last padded token for the projection. Please take a look at the internal code of LlamaForSequenceClassification
. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1414
Some updates on @DSKSD's comments!
If the model.config.pad_token_id is set to None, it selects the last padded token for the projection. Please take a look at the internal code of LlamaForSequenceClassification. https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1414
Use this link (I fixed the commit id): https://github.com/huggingface/transformers/blob/0a7af19f4dc868bafc82f35eb7e8d13bac87a594/src/transformers/models/llama/modeling_llama.py#L1390
Also, I verified that :
EleutherAI/pythia-1.4b
has no pad token (ref: link) mistralai/Mistral-7B-v0.1
has no pad token (ref: link). bigscience/bloom-560m
has pad tokenfacebook/opt-350m
has pad tokenCan be the explanation behind the https://github.com/huggingface/trl/issues/937#issuecomment-1803332120 👀
Hi, thanks for maintaining awesome project.
I slightly modified
examples/scripts/reward_modeling.py
and found and the tracked training loss and accuracy are so weird.Here is my modified script.
and below are training loss and accuracy
Training loss converges to
0.691
which means that the reward model cannot tell the difference betweenchosen
andrejected
.Meanwhile, the accuracy on test dataset is close to 1, which means the model can almost perfectly tell the difference between
chosen
andrejected
.Is this an expected behavior?