Open guijuzhejiang opened 1 year ago
hi, @guijuzhejiang , since the stage 3 of RLHF uses reinforcement learning (here we use PPO algorithm), its time-consuming and unstability may caused by dataset size and dynamic training progress. Debugging RL algorithms is not easy and we are also validating the PPO training, you may use simple environments for testing or vsualize some stats (such as running mean, std, min, max or episode returns, KL of policy update, etc) to check whether the training is correct or not.
@Camille7777 Hi. Is there options or callbacks which i can check reward_mean, KL or other metrics while training?
I did the third step of PPO training, it was time consuming and unstable. The reward observed during training is between -300 and -10 as follows. Is this situation normal? What does a good PPO training look like? Is there a log that can be confirmed? Episode [1/11]: 100%|██████████| 200/200 [1:41:33<00:00, 30.47s/it] Episode [2/11]: 100%|██████████| 200/200 [1:45:15<00:00, 31.58s/it] Episode [3/11]: 100%|██████████| 200/200 [1:45:56<00:00, 31.78s/it] Episode [4/11]: 100%|██████████| 200/200 [1:45:38<00:00, 31.69s/it] Train epoch [1/2]: 100%|██████████| 1000/1000 [1:23:58<00:00, 5.04s/it, reward=-7.67] Train epoch [2/2]: 100%|██████████| 1000/1000 [1:23:51<00:00, 5.03s/it, reward=-7.74] Episode [5/11]: 100%|██████████| 200/200 [4:33:10<00:00, 81.95s/it] it, reward=-7.74] Episode [6/11]: 100%|██████████| 200/200 [1:44:04<00:00, 31.22s/it] Episode [7/11]: 100%|██████████| 200/200 [1:44:18<00:00, 31.29s/it] Episode [8/11]: 100%|██████████| 200/200 [1:44:10<00:00, 31.25s/it] Episode [9/11]: 100%|██████████| 200/200 [1:44:24<00:00, 31.32s/it] Episode [10/11]: 100%|█████████▉| 199/200 [1:41:53<00:30, 30.14s/it] Train epoch [1/2]: 26%|██▌ | 261/1000 [21:50<1:02:27, 5.07s/it, reward=-188]