【problem discuss】Critic Loss can not decrease

watermelon-lee commented 1 year ago

Here are my situation:

finished step 2 with cohere/zhihu_query dataset. The final reward score is 5.07, rejected score is 0.8, and the acc is 0.79. So the step 2 seems sucessful.
when I attempt to step 3. I met loss scale maximum problem which solved by change the learning rate(actor & critic). Then I met a problem, The Critic loss can not decrease. In many experiments, It changed from 4 to 7 or stay with 5.

here are my problems:

I tried to test the model(actor). I found the actor model's performence is better than the sft model. Is it normal？
The actor loss = - advantage * clip(ratio). I obtain the actor loss in my log, it changed from -0.1 to -2. So the clip(ratio) is around 0.8-1.2, This means the advantage is bigger than 0 and inscreased during training. Advantage means the action take by the actor model is berrter or bad than average（baseline), so bigger advantage is better and smaller actor loss is better( since the advantage bigger, the actor loss is smaller)?

looking forward to your reply thanks.

TheEighthDay commented 1 year ago

How do you evaluate the statement "actor model's performance is better than the “sft model" ? Have you turned off LORA?

And，do you have any modifications for this issue ?

Line 76 in ppo_trainer.py for _generate_sequence : min_length=max_min_length

"The min_length setting force the model generate to max length, which produce repeated or nonsense result."

watermelon-lee commented 1 year ago

How do you evaluate the statement "actor model's performance is better than the “sft model" ? Have you turned off LORA?

And，do you have any modifications for this issue ?

Line 76 in ppo_trainer.py for _generate_sequence : min_length=max_min_length

"The min_length setting force the model generate to max length, which produce repeated or nonsense result."

use the actor model answer the query of test dataset compare with the sft model. The mean reward score of actor model is around 6 and the score of sft model is 4.5. Besides，I judge some case by myself and more cases become better than worse.
do not use lora.
max_length = min_length = max_min_length is wrong, which makes the model produce bias asnwer. so I changed the code. 3.1 first I delete the line "min_length = ma_min_length", but when I run the code, It will stack in first generation time. Mabye the length of seq in different GPU are not the same make this happen? 3.2 then I add eos_token_id = "xxx", xxx is any unused id. So the model will produce the answer reach min_max_length, but the seq contains the real eos_token_id. I remake the attenion_mask to mask the words after eos_token. So the seq and attention mask like: seq = [0,0,0,0, prompt, answer, eos_token_id, other_word] mask = [0,0,0,0,1(prompt),1(answer),1(eos_token_id),0(other_word)]

watermelon-lee commented 1 year ago

another qustion： critic_loss: def critic_loss_fn(self, values, old_values, returns, mask):

value loss

    values_clipped = torch.clamp(
        values,
        old_values - self.cliprange_value,
        old_values + self.cliprange_value,
    )
    vf_loss1 = (values - returns) ** 2
    vf_loss2 = (values_clipped - returns) ** 2
    vf_loss = 0.5 * torch.sum(
        torch.max(vf_loss1, vf_loss2) * mask) / mask.sum()
    return vf_loss

the log of your exapmle: /DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_log_output/actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log

step 467: cri_loss: -0.0131072998046875

why the critic_loss < 0?
cirtic loss is MSE of (values, returns).

looking forward to your reply thanks.

TheEighthDay commented 1 year ago

How do you evaluate the statement "actor model's performance is better than the “sft model" ? Have you turned off LORA? And，do you have any modifications for this issue ? Line 76 in ppo_trainer.py for _generate_sequence : min_length=max_min_length "The min_length setting force the model generate to max length, which produce repeated or nonsense result."

use the actor model answer the query of test dataset compare with the sft model. The mean reward score of actor model is around 6 and the score of sft model is 4.5. Besides，I judge some case by myself and more cases become better than worse.

do not use lora.

max_length = min_length = max_min_length is wrong, which makes the model produce bias asnwer. so I changed the code. 3.1 first I delete the line "min_length = ma_min_length", but when I run the code, It will stack in first generation time. Mabye the length of seq in different GPU are not the same make this happen? 3.2 then I add eos_token_id = "xxx", xxx is any unused id. So the model will produce the answer reach min_max_length, but the seq contains the real eos_token_id. I remake the attenion_mask to mask the words after eos_token. So the seq and attention mask like: seq = [0,0,0,0, prompt, answer, eos_token_id, other_word] mask = [0,0,0,0,1(prompt),1(answer),1(eos_token_id),0(other_word)]

Thank you! I have encountered a similar problem and the approach to handling it was also similar. However, I have not found the cause of the stack in first generation either.

TheEighthDay commented 1 year ago

another qustion： critic_loss: def critic_loss_fn(self, values, old_values, returns, mask): ## value loss values_clipped = torch.clamp( values, old_values - self.cliprange_value, old_values + self.cliprange_value, ) vf_loss1 = (values - returns) 2 vf_loss2 = (values_clipped - returns) 2 vf_loss = 0.5 torch.sum( torch.max(vf_loss1, vf_loss2) mask) / mask.sum() return vf_loss

the log: actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log

step 467: cri_loss: -0.0131072998046875

why the critic_loss < 0? cirtic loss is MSE of (values, returns).

looking forward to your reply thanks.

In my experiment, the critic loss will not be less than 0.

watermelon-lee commented 1 year ago

another qustion： critic_loss: def critic_loss_fn(self, values, old_values, returns, mask): ## value loss values_clipped = torch.clamp( values, old_values - self.cliprange_value, old_values + self.cliprange_value, ) vf_loss1 = (values - returns) 2 vf_loss2 = (values_clipped - returns) 2 vf_loss = 0.5 torch.sum( torch.max(vf_loss1, vf_loss2) mask) / mask.sum() return vf_loss the log: actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log step 467: cri_loss: -0.0131072998046875 why the critic_loss < 0? cirtic loss is MSE of (values, returns). looking forward to your reply thanks.

In my experiment, the critic loss will not be less than 0.

thank you!

the log I mentioned was their official example log（ /DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_log_output/actor_opt-1.3b_critic_opt-350m_globalBatchSize64.log）. The cri loss less than 0 was weird.

In my case: In step 3, The queries of cohere/zhihu dataset are around 1300. So with batch size 64, the total train step was around 200. The cri loss is from 4 increased to 7. The best experiment, the cri loss from 7+ decreased to 5.6.

according to your loss, mabye I need more data in this step.

TheEighthDay commented 1 year ago

Maybe you could adjust the learning rate，a significant change in the loss around 5e-7: Actor_Lr=9.65e-6 Critic_Lr=2e-6

Although my loss seems to be decreasing, my performance has not improved. The actor tends to generate very long sentences, and the results are not good. I haven't found the reason yet.

watermelon-lee commented 1 year ago

Maybe you could adjust the learning rate，a significant change in the loss around 5e-7: Actor_Lr=9.65e-6 Critic_Lr=2e-6

ok I will try it. thank you!

Although my loss seems to be decreasing, my performance has not improved. The actor tends to generate very long sentences, and the results are not good. I haven't found the reason yet.

What about your actor loss and reward score. after all the actor model is all we need. In my log, the actor loss from -0.5 decrease to -3. reward incresed from 3+ to 4+.

TheEighthDay commented 1 year ago

Maybe you could adjust the learning rate，a significant change in the loss around 5e-7: Actor_Lr=9.65e-6 Critic_Lr=2e-6

ok I will try it. thank you!

Although my loss seems to be decreasing, my performance has not improved. The actor tends to generate very long sentences, and the results are not good. I haven't found the reason yet.

What about your actor loss and reward score. after all the actor model is all we need. In my log, the actor loss from -0.5 decrease to -3. reward incresed from 3+ to 4+.

reward incresed from -2 to 2. actor loss from 2 decrease to 0.

Hi, I used different datasets and added an L2 norm constraint to the score in training reward model. It may not be significant for you to refer to, and I still debug my PPO😭.

watermelon-lee commented 1 year ago

hi, finally I finish this problem. critic model used to estimate the value(return) of actor. the return in my log first around 4 to 5, and increased to 18-20 at last. The threshold of return is too large and critic model is hard to estimate it. so the critic loss, which is MSE of value and return, become bigger and bigger.

So I used reward scale(running mean std, trlx has it's completion) and change gamma from 1 to 0.99. these make the return scale to a smaller boundary. Finally the model get a good performernce.

TheEighthDay commented 1 year ago

Thanks!

If possible, I would like to know the range of your reward, act_loss, and advantage. I have already added advantage whitening in trlx, and I will also add reward scaling trick in trlx.

Additionally, I tried using sharegpt data for ptx loss, but it doesn't seem to have any effect on the process. Maybe you can try it too.

Thanks again.

watermelon-lee commented 1 year ago

reward score can range from -10 - 10+ (clip range value= 10) act_loss: [-2,2] advantage: [-3,3], usually [-1,1]

I don't use ptx loss for now. but I will try it next week to see it's affect.

happy weekend :)

TheEighthDay commented 1 year ago

I attempted to add the reward scaling in trlx to deepspeepchat, but during the training process, the running std became infinite. Can you tell me how you added this part of the code?

watermelon-lee commented 1 year ago

I attempted to add the reward scaling in trlx to deepspeepchat, but during the training process, the running std became infinite. Can you tell me how you added this part of the code?

I met the problem too, which cause the reward score become 0 after this happen. But the model after training was very good.

I changed the code and drop the inf std data.

here's my param: self.kl_ctl = 0.2 self.clip_reward_value = 10 self.cliprange = 0.2 self.cliprange_value = 0.2 self.gamma = 0.99 self.lam = 0.95 add reward scale cancel advantage whiten(not helpful in my case)

and I found a good blog which introduce many useful trick in PPO.(https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)

TheEighthDay commented 1 year ago

Thank you, bro! Actually, I added crop before the update[trlx also did crop before scaling], but it seems to have no effect. I'm confused and investigating the reason.
reward_score = torch.clamp(reward_score, -self.clip_reward_value,self.clip_reward_value)
_,_ = self.running.update(reward_score)
reward_score /= self.running.std

watermelon-lee commented 1 year ago

I add clip before scaling too. I don't measure the effect on it. But the blog do scale before clip. Maybe you can try it. The code in blog: rews = np.clip(rews / np.sqrt(self.ret_rms.var + self.epsilon), -self.cliprew, self.cliprew)

TheEighthDay commented 1 year ago

Thank you, bro! Actually, I added crop before the update[trlx also did crop before scaling], but it seems to have no effect. I'm confused and investigating the reason.
reward_score = torch.clamp(reward_score, -self.clip_reward_value,self.clip_reward_value)
_,_ = self.running.update(reward_score)
reward_score /= self.running.std

Numeric overflow caused an inf error. You can try the following methods to avoid it.

        if dist.is_initialized():
            xs_mean, xs_var, xs_count = get_global_statistics(xs)
            xs_mean, xs_var, xs_count = float(xs_mean),float(xs_var),float(xs_count)
        else:
            xs_count = xs.numel()
            xs_var, xs_mean = torch.var_mean(xs, unbiased=False)
            xs_mean, xs_var, xs_count = float(xs_mean),float(xs_var),float(xs_count)

microsoft / DeepSpeedExamples

【problem discuss】Critic Loss can not decrease #556

value loss