Open scarydemon2 opened 1 year ago
According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?
According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?
In fact I'm not. I used per_device_train_batch_size=3 and per_device_mini_batch_size=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.
hello, do you solve it? my average reward is still not increasing during training.
According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?
In fact I'm not. I used _per_device_train_batch_size=3 and per_device_mini_batchsize=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.
I found there is a problem. The problem may be in the model.make_experenice. I printed the actor generate seq in training and found the actor sample was so bad. I think this may be the point why the average reward didn't increase and very low?
I found there is a problem. The problem may be in the model.make_experenice. I printed the actor generate seq in training and found the actor sample was so bad. I think this may be the point why the average reward didn't increase and very low?
yes,you're right. And the prompt is also not so harmful (I think the more difficult prompts, more effective). I used llama-7b for my base model, but it seems that the model can't sample many good response. could you share your base model size? BTW, Do you think the reward model will have a impact on this issue? since I use hh-rlhf dataset now, I found its acc is only 66%. So if our reward model is not strong, will the average reward increase during the correct settings of PPO.
hello, do you solve it? my average reward is still not increasing during training.
According to readme, "We have found that it is very unstable to use different generation training batch sizes (--per_device_train_batch_size) and PPO training batch sizes (--per_device_mini_batch_size), more than one PPO training epoch (--ppo_epochs), or more than one generation batch size (--generation_batch_numbers)." Did you make the --per_device_train_batch_size and --per_device_mini_batch_size to be the same?
In fact I'm not. I used _per_device_train_batch_size=3 and per_device_mini_batchsize=3, because my dataset is customed, the average length is larger than 256 which is given in the official script. So I setted --max_answer_seq_len=500 and --max_prompt_seq_len=500 to avoid OOM problem. the rest setting is same as official's.
hi, sorry for the late reply. In fact I have solved this problem. The reason for my problem was that the beta(kl_ctl) of KL-div was so small which is 0.02 in the official repo. So after several steps training, my model deviated a lot from the ref-model and collapsed so that the output string just the same meaningless word(like \n\n\n\n\n\n\n\n\n\n ...). These meaningless words will lead my reward model return same scores. To solve this problem I increase kl_ctl from 0.02 to 0.2-0.4 and this made my reward score normal. By the way, I also tried the dynamic adjust method for kl_ctl which was mentioned in PPO's paper and It performed worse than static beta=0.2
@scarydemon2 - hi, I noticed this repo the step3 ppo not only used the KL penalty but also used clip surrogate objetive for actor. Is the code duplicated? Any reply will be appreciative.
`epoch: 0|step: 259|ppo_ep: 1|act_loss: 0.0253753662109375|cri_loss: 0.2144775390625|unsuper_loss: 0.0 average reward score: 0.20556640625
epoch: 0|step: 260|ppo_ep: 1|act_loss: 0.1915283203125|cri_loss: 0.326171875|unsuper_loss: 0.0 average reward score: 0.205810546875
epoch: 0|step: 261|ppo_ep: 1|act_loss: -0.1837158203125|cri_loss: 0.2259521484375|unsuper_loss: 0.0 average reward score: 0.2064208984375
epoch: 0|step: 262|ppo_ep: 1|act_loss: -0.099609375|cri_loss: 0.1646728515625|unsuper_loss: 0.0 average reward score: 0.2059326171875
epoch: 0|step: 263|ppo_ep: 1|act_loss: -0.07781982421875|cri_loss: 0.28271484375|unsuper_loss: 0.0 average reward score: 0.20654296875
epoch: 0|step: 264|ppo_ep: 1|act_loss: 0.10009765625|cri_loss: 0.303955078125|unsuper_loss: 0.0 average reward score: 0.2060546875
epoch: 0|step: 265|ppo_ep: 1|act_loss: 0.10357666015625|cri_loss: 0.332275390625|unsuper_loss: 0.0 average reward score: 0.2078857421875
epoch: 0|step: 266|ppo_ep: 1|act_loss: -0.062744140625|cri_loss: 0.23828125|unsuper_loss: 0.0 average reward score: 0.2061767578125
epoch: 0|step: 267|ppo_ep: 1|act_loss: 0.1456298828125|cri_loss: 0.33837890625|unsuper_loss: 0.0 average reward score: 0.2064208984375
epoch: 0|step: 268|ppo_ep: 1|act_loss: 0.0635986328125|cri_loss: 0.20068359375|unsuper_loss: 0.0 average reward score: 0.207275390625
[2023-06-09 00:06:07,687] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=10, lr=[1.1237076437413556e-05, 1.1237076437413556e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-09 00:06:07,820] [INFO] [timer.py:208:stop] epoch=0/micro_step=270/global_step=270, RunningAvgSamplesPerSec=2.183121824503433, CurrSamplesPerSec=11.893856302438598, MemAllocated=49.03GB, MaxMemAllocated=57.62GB [2023-06-09 00:06:08,154] [INFO] [logging.py:96:log_dist] [Rank 0] step=270, skipped=3, lr=[4.6543648237896e-06, 4.6543648237896e-06], mom=[(0.9, 0.95), (0.9, 0.95)] epoch: 0|step: 269|ppo_ep: 1|act_loss: -0.03240966796875|cri_loss: 0.1427001953125|unsuper_loss: 0.0 average reward score: 0.205078125
epoch: 0|step: 270|ppo_ep: 1|act_loss: 0.342041015625|cri_loss: 0.377685546875|unsuper_loss: 0.0 average reward score: 0.2064208984375
epoch: 0|step: 271|ppo_ep: 1|act_loss: 0.138427734375|cri_loss: 0.2430419921875|unsuper_loss: 0.0 average reward score: 0.205322265625
epoch: 0|step: 272|ppo_ep: 1|act_loss: 0.1181640625|cri_loss: 0.21337890625|unsuper_loss: 0.0 average reward score: 0.20703125
epoch: 0|step: 273|ppo_ep: 1|act_loss: 0.06524658203125|cri_loss: 0.1839599609375|unsuper_loss: 0.0 average reward score: 0.206298828125
epoch: 0|step: 274|ppo_ep: 1|act_loss: 0.07135009765625|cri_loss: 0.1356201171875|unsuper_loss: 0.0 average reward score: 0.2081298828125
epoch: 0|step: 275|ppo_ep: 1|act_loss: 0.066650390625|cri_loss: 0.2161865234375|unsuper_loss: 0.0 average reward score: 0.2071533203125
epoch: 0|step: 276|ppo_ep: 1|act_loss: 0.05303955078125|cri_loss: 0.2177734375|unsuper_loss: 0.0 average reward score: 0.2059326171875
epoch: 0|step: 277|ppo_ep: 1|act_loss: 0.015899658203125|cri_loss: 0.1387939453125|unsuper_loss: 0.0 average reward score: 0.2060546875
epoch: 0|step: 278|ppo_ep: 1|act_loss: -0.0144195556640625|cri_loss: 0.26025390625|unsuper_loss: 0.0 average reward score: 0.20556640625
[2023-06-09 00:20:13,519] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=10, lr=[1.1141143057005536e-05, 1.1141143057005536e-05], mom=[(0.9, 0.95), (0.9, 0.95)] [2023-06-09 00:20:13,652] [INFO] [timer.py:208:stop] epoch=0/micro_step=280/global_step=280, RunningAvgSamplesPerSec=2.2499286746168656, CurrSamplesPerSec=13.297137840544718, MemAllocated=49.03GB, MaxMemAllocated=57.62GB [2023-06-09 00:20:13,986] [INFO] [logging.py:96:log_dist] [Rank 0] step=280, skipped=3, lr=[4.612866045608177e-06, 4.612866045608177e-06], mom=[(0.9, 0.95), (0.9, 0.95)] epoch: 0|step: 279|ppo_ep: 1|act_loss: -0.038238525390625|cri_loss: 0.2322998046875|unsuper_loss: 0.0 average reward score: 0.2037353515625
epoch: 0|step: 280|ppo_ep: 1|act_loss: -0.03887939453125|cri_loss: 0.264404296875|unsuper_loss: 0.0 average reward score: 0.2056884765625
epoch: 0|step: 281|ppo_ep: 1|act_loss: -0.0809326171875|cri_loss: 0.325927734375|unsuper_loss: 0.0 average reward score: 0.205078125
epoch: 0|step: 282|ppo_ep: 1|act_loss: -0.0087890625|cri_loss: 0.281982421875|unsuper_loss: 0.0 average reward score: 0.205322265625
epoch: 0|step: 283|ppo_ep: 1|act_loss: -0.1871337890625|cri_loss: 0.302734375|unsuper_loss: 0.0 average reward score: 0.205078125
epoch: 0|step: 284|ppo_ep: 1|act_loss: -0.126220703125|cri_loss: 0.2880859375|unsuper_loss: 0.0 average reward score: 0.2052001953125
epoch: 0|step: 285|ppo_ep: 1|act_loss: -0.07843017578125|cri_loss: 0.2890625|unsuper_loss: 0.0 average reward score: 0.207275390625
epoch: 0|step: 286|ppo_ep: 1|act_loss: -0.0885009765625|cri_loss: 0.240478515625|unsuper_loss: 0.0 average reward score: 0.2061767578125
epoch: 0|step: 287|ppo_ep: 1|act_loss: -0.035888671875|cri_loss: 0.24755859375|unsuper_loss: 0.0 average reward score: 0.2069091796875
epoch: 0|step: 288|ppo_ep: 1|act_loss: 0.01471710205078125|cri_loss: 0.2418212890625|unsuper_loss: 0.0 average reward score: 0.20458984375
[2023-06-09 00:34:19,262] [INFO] [logging.py:96:log_dist] [Rank 0] step=290, skipped=10, l`
The loss was oscillating at the beginning, but collapsed after about 200 steps.
I tested the model during the period of loss oscillation and after the loss collapse respectively, and found that the performance of both models was far worse than the original model, and they could not produce normal outputs.
During the training of Step 3, the reward score of my language model collapsed to a stable point and the output of the model became completely chaotic. Has anyone encountered this phenomenon?