During the training of Step 3, the reward score of my language model collapsed to a stable point

scarydemon2 commented 1 year ago

`epoch: 0|step: 259|ppo_ep: 1|act_loss: 0.0253753662109375|cri_loss: 0.2144775390625|unsuper_loss: 0.0 average reward score: 0.20556640625

epoch: 0|step: 260|ppo_ep: 1|act_loss: 0.1915283203125|cri_loss: 0.326171875|unsuper_loss: 0.0 average reward score: 0.205810546875

epoch: 0|step: 261|ppo_ep: 1|act_loss: -0.1837158203125|cri_loss: 0.2259521484375|unsuper_loss: 0.0 average reward score: 0.2064208984375

epoch: 0|step: 262|ppo_ep: 1|act_loss: -0.099609375|cri_loss: 0.1646728515625|unsuper_loss: 0.0 average reward score: 0.2059326171875

epoch: 0|step: 263|ppo_ep: 1|act_loss: -0.07781982421875|cri_loss: 0.28271484375|unsuper_loss: 0.0 average reward score: 0.20654296875

epoch: 0|step: 264|ppo_ep: 1|act_loss: 0.10009765625|cri_loss: 0.303955078125|unsuper_loss: 0.0 average reward score: 0.2060546875

epoch: 0|step: 265|ppo_ep: 1|act_loss: 0.10357666015625|cri_loss: 0.332275390625|unsuper_loss: 0.0 average reward score: 0.2078857421875

epoch: 0|step: 266|ppo_ep: 1|act_loss: -0.062744140625|cri_loss: 0.23828125|unsuper_loss: 0.0 average reward score: 0.2061767578125

epoch: 0|step: 267|ppo_ep: 1|act_loss: 0.1456298828125|cri_loss: 0.33837890625|unsuper_loss: 0.0 average reward score: 0.2064208984375

epoch: 0|step: 268|ppo_ep: 1|act_loss: 0.0635986328125|cri_loss: 0.20068359375|unsuper_loss: 0.0 average reward score: 0.207275390625

[2023-06-09 00:06:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=10, lr=[1.1237076437413556e-05, 1.1237076437413556e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-09 00:06:07,820] [INFO] [timer.py:208:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=2.183121824503433, CurrSamplesPerSec=11.893856302438598, MemAllocated=49.03GB, MaxMemAllocated=57.62GB [2023-06-09 00:06:08,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=3, lr=[4.6543648237896e-06, 4.6543648237896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] epoch: 0|step: 269|ppo_ep: 1|act_loss: -0.03240966796875|cri_loss: 0.1427001953125|unsuper_loss: 0.0 average reward score: 0.205078125

epoch: 0|step: 270|ppo_ep: 1|act_loss: 0.342041015625|cri_loss: 0.377685546875|unsuper_loss: 0.0 average reward score: 0.2064208984375

epoch: 0|step: 271|ppo_ep: 1|act_loss: 0.138427734375|cri_loss: 0.2430419921875|unsuper_loss: 0.0 average reward score: 0.205322265625

epoch: 0|step: 272|ppo_ep: 1|act_loss: 0.1181640625|cri_loss: 0.21337890625|unsuper_loss: 0.0 average reward score: 0.20703125

epoch: 0|step: 273|ppo_ep: 1|act_loss: 0.06524658203125|cri_loss: 0.1839599609375|unsuper_loss: 0.0 average reward score: 0.206298828125

epoch: 0|step: 274|ppo_ep: 1|act_loss: 0.07135009765625|cri_loss: 0.1356201171875|unsuper_loss: 0.0 average reward score: 0.2081298828125

epoch: 0|step: 275|ppo_ep: 1|act_loss: 0.066650390625|cri_loss: 0.2161865234375|unsuper_loss: 0.0 average reward score: 0.2071533203125

epoch: 0|step: 276|ppo_ep: 1|act_loss: 0.05303955078125|cri_loss: 0.2177734375|unsuper_loss: 0.0 average reward score: 0.2059326171875

epoch: 0|step: 277|ppo_ep: 1|act_loss: 0.015899658203125|cri_loss: 0.1387939453125|unsuper_loss: 0.0 average reward score: 0.2060546875

epoch: 0|step: 278|ppo_ep: 1|act_loss: -0.0144195556640625|cri_loss: 0.26025390625|unsuper_loss: 0.0 average reward score: 0.20556640625

[2023-06-09 00:20:13,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=10, lr=[1.1141143057005536e-05, 1.1141143057005536e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-09 00:20:13,652] [INFO] [timer.py:208:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=2.2499286746168656, CurrSamplesPerSec=13.297137840544718, MemAllocated=49.03GB, MaxMemAllocated=57.62GB [2023-06-09 00:20:13,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=3, lr=[4.612866045608177e-06, 4.612866045608177e-06], mom=[(0.9, 0.95), (0.9, 0.95)] epoch: 0|step: 279|ppo_ep: 1|act_loss: -0.038238525390625|cri_loss: 0.2322998046875|unsuper_loss: 0.0 average reward score: 0.2037353515625

epoch: 0|step: 280|ppo_ep: 1|act_loss: -0.03887939453125|cri_loss: 0.264404296875|unsuper_loss: 0.0 average reward score: 0.2056884765625

epoch: 0|step: 281|ppo_ep: 1|act_loss: -0.0809326171875|cri_loss: 0.325927734375|unsuper_loss: 0.0 average reward score: 0.205078125

epoch: 0|step: 282|ppo_ep: 1|act_loss: -0.0087890625|cri_loss: 0.281982421875|unsuper_loss: 0.0 average reward score: 0.205322265625

epoch: 0|step: 283|ppo_ep: 1|act_loss: -0.1871337890625|cri_loss: 0.302734375|unsuper_loss: 0.0 average reward score: 0.205078125

epoch: 0|step: 284|ppo_ep: 1|act_loss: -0.126220703125|cri_loss: 0.2880859375|unsuper_loss: 0.0 average reward score: 0.2052001953125

epoch: 0|step: 285|ppo_ep: 1|act_loss: -0.07843017578125|cri_loss: 0.2890625|unsuper_loss: 0.0 average reward score: 0.207275390625

epoch: 0|step: 286|ppo_ep: 1|act_loss: -0.0885009765625|cri_loss: 0.240478515625|unsuper_loss: 0.0 average reward score: 0.2061767578125

epoch: 0|step: 287|ppo_ep: 1|act_loss: -0.035888671875|cri_loss: 0.24755859375|unsuper_loss: 0.0 average reward score: 0.2069091796875

epoch: 0|step: 288|ppo_ep: 1|act_loss: 0.01471710205078125|cri_loss: 0.2418212890625|unsuper_loss: 0.0 average reward score: 0.20458984375

[2023-06-09 00:34:19,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=10, l`

The loss was oscillating at the beginning, but collapsed after about 200 steps.

I tested the model during the period of loss oscillation and after the loss collapse respectively, and found that the performance of both models was far worse than the original model, and they could not produce normal outputs.

During the training of Step 3, the reward score of my language model collapsed to a stable point and the output of the model became completely chaotic. Has anyone encountered this phenomenon?

harveyp123 commented 1 year ago

According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?

scarydemon2 commented 1 year ago

According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?

In fact I'm not. I used per_device_train_batch_size=3 and per_device_mini_batch_size=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.

gouqi666 commented 1 year ago

hello, do you solve it? my average reward is still not increasing during training.

According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?

In fact I'm not. I used _per_device_train_batch_size=3 and per_device_mini_batchsize=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.

Junyiliu0 commented 1 year ago

I found there is a problem. The problem may be in the model.make_experenice. I printed the actor generate seq in training and found the actor sample was so bad. I think this may be the point why the average reward didn't increase and very low?

gouqi666 commented 1 year ago

I found there is a problem. The problem may be in the model.make_experenice. I printed the actor generate seq in training and found the actor sample was so bad. I think this may be the point why the average reward didn't increase and very low?

yes，you're right. And the prompt is also not so harmful (I think the more difficult prompts, more effective). I used llama-7b for my base model, but it seems that the model can't sample many good response. could you share your base model size? BTW, Do you think the reward model will have a impact on this issue? since I use hh-rlhf dataset now, I found its acc is only 66%. So if our reward model is not strong, will the average reward increase during the correct settings of PPO.

scarydemon2 commented 1 year ago

hello, do you solve it? my average reward is still not increasing during training.

According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?

In fact I'm not. I used _per_device_train_batch_size=3 and per_device_mini_batchsize=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.

hi, sorry for the late reply. In fact I have solved this problem. The reason for my problem was that the beta(kl_ctl) of KL-div was so small which is 0.02 in the official repo. So after several steps training, my model deviated a lot from the ref-model and collapsed so that the output string just the same meaningless word(like \n\n\n\n\n\n\n\n\n\n ...). These meaningless words will lead my reward model return same scores. To solve this problem I increase kl_ctl from 0.02 to 0.2-0.4 and this made my reward score normal. By the way, I also tried the dynamic adjust method for kl_ctl which was mentioned in PPO's paper and It performed worse than static beta=0.2

EeyoreLee commented 11 months ago

@scarydemon2 - hi, I noticed this repo the step3 ppo not only used the KL penalty but also used clip surrogate objetive for actor. Is the code duplicated? Any reply will be appreciative.

microsoft / DeepSpeedExamples