How to evaluate the effect of PPO training in coati chat

guijuzhejiang commented 1 year ago

I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal? What does a good PPO training look like? Is there a log that can be confirmed? Episode [1/11]: 100%|██████████| 200/200 [1:41:33<00:00, 30.47s/it] Episode [2/11]: 100%|██████████| 200/200 [1:45:15<00:00, 31.58s/it] Episode [3/11]: 100%|██████████| 200/200 [1:45:56<00:00, 31.78s/it] Episode [4/11]: 100%|██████████| 200/200 [1:45:38<00:00, 31.69s/it] Train epoch [1/2]: 100%|██████████| 1000/1000 [1:23:58<00:00, 5.04s/it, reward=-7.67] Train epoch [2/2]: 100%|██████████| 1000/1000 [1:23:51<00:00, 5.03s/it, reward=-7.74] Episode [5/11]: 100%|██████████| 200/200 [4:33:10<00:00, 81.95s/it] it, reward=-7.74] Episode [6/11]: 100%|██████████| 200/200 [1:44:04<00:00, 31.22s/it] Episode [7/11]: 100%|██████████| 200/200 [1:44:18<00:00, 31.29s/it] Episode [8/11]: 100%|██████████| 200/200 [1:44:10<00:00, 31.25s/it] Episode [9/11]: 100%|██████████| 200/200 [1:44:24<00:00, 31.32s/it] Episode [10/11]: 100%|█████████▉| 199/200 [1:41:53<00:30, 30.14s/it] Train epoch [1/2]: 26%|██▌ | 261/1000 [21:50<1:02:27, 5.07s/it, reward=-188]

Camille7777 commented 1 year ago

hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress. Debugging RL algorithms is not easy and we are also validating the PPO training, you may use simple environments for testing or vsualize some stats (such as running mean, std, min, max or episode returns, KL of policy update, etc) to check whether the training is correct or not.

allzero-kwon commented 11 months ago

@Camille7777 Hi. Is there options or callbacks which i can check reward_mean, KL or other metrics while training?

hpcaitech / ColossalAI

How to evaluate the effect of PPO training in coati chat #3574