microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.68k stars 281 forks source link

[minillm] hard to reproduce the result #118

Closed wutaiqiang closed 11 months ago

wutaiqiang commented 11 months ago

I try to distill gpt2-1.5B -> gpt2-120M As I use 4 A100, so I change the GPUS_PER_NODE to ${3-4}

Batch size remains the same

wutaiqiang commented 11 months ago

Then I get the log:

eval | rougeL: 21.200 | exact_match: 2.600 | rev_kl: 2.443 | lens: 58.775 | pt_loss: 3.014 | lm_loss: 3.430 | kd_loss: 2.598 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 1/ 160| tot_loss: 6.6411 | rl_loss: 3.4057 | pt_loss: 3.2354 | pg_loss: 1.6204 | reg_loss: 1.7853 | reward: -1.3770 | rev_kl: 1.8445 | stu_lens: 36.8125 | mixed_lens: 56.6875 | lr: 5.0000e-08 | scale: 2048.00 | time: 0.501 | step time: 0.501 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 2/ 160| tot_loss: 7.3838 | rl_loss: 4.4417 | pt_loss: 2.9421 | pg_loss: 1.1692 | reg_loss: 3.2725 | reward: -2.1406 | rev_kl: 2.9407 | stu_lens: 36.2500 | mixed_lens: 43.1250 | lr: 5.0000e-08 | scale: 2048.00 | time: 0.493 | step time: 0.493 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 3/ 160| tot_loss: 6.9746 | rl_loss: 3.9665 | pt_loss: 3.0081 | pg_loss: 0.8613 | reg_loss: 3.1053 | reward: -2.0117 | rev_kl: 2.8732 | stu_lens: 90.2500 | mixed_lens: 59.3125 | lr: 1.0000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 4/ 160| tot_loss: 6.4336 | rl_loss: 3.7142 | pt_loss: 2.7194 | pg_loss: 0.9492 | reg_loss: 2.7650 | reward: -2.2148 | rev_kl: 2.9595 | stu_lens: 44.6875 | mixed_lens: 58.9375 | lr: 1.5000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 5/ 160| tot_loss: 4.7745 | rl_loss: 1.7151 | pt_loss: 3.0595 | pg_loss: 0.6897 | reg_loss: 1.0254 | reward: -2.3086 | rev_kl: 2.4704 | stu_lens: 41.1875 | mixed_lens: 50.1875 | lr: 2.0000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 6/ 160| tot_loss: 8.4113 | rl_loss: 5.2663 | pt_loss: 3.1450 | pg_loss: 2.2961 | reg_loss: 2.9702 | reward: -1.4766 | rev_kl: 2.3114 | stu_lens: 67.1250 | mixed_lens: 56.1250 | lr: 2.0000e-07 | scale: 2048.00 | time: 0.360 | step time: 0.360 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 7/ 160| tot_loss: 7.3991 | rl_loss: 4.3857 | pt_loss: 3.0135 | pg_loss: 1.0776 | reg_loss: 3.3081 | reward: -1.8721 | rev_kl: 2.6794 | stu_lens: 47.4375 | mixed_lens: 63.8750 | lr: 2.5000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 8/ 160| tot_loss: 6.8751 | rl_loss: 3.5853 | pt_loss: 3.2898 | pg_loss: 0.7668 | reg_loss: 2.8186 | reward: -2.0840 | rev_kl: 3.1566 | stu_lens: 52.2500 | mixed_lens: 47.8750 | lr: 3.0000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 9/ 160| tot_loss: 5.8591 | rl_loss: 2.6858 | pt_loss: 3.1733 | pg_loss: 0.8090 | reg_loss: 1.8767 | reward: -1.3281 | rev_kl: 2.4784 | stu_lens: 55.5625 | mixed_lens: 49.1250 | lr: 3.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 10/ 160| tot_loss: 7.6280 | rl_loss: 4.5530 | pt_loss: 3.0750 | pg_loss: 1.2175 | reg_loss: 3.3355 | reward: -2.3711 | rev_kl: 2.9943 | stu_lens: 52.1875 | mixed_lens: 60.1250 | lr: 4.0000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 11/ 160| tot_loss: 8.0311 | rl_loss: 4.8438 | pt_loss: 3.1873 | pg_loss: 1.5827 | reg_loss: 3.2611 | reward: -1.7314 | rev_kl: 2.2585 | stu_lens: 34.9375 | mixed_lens: 44.1875 | lr: 4.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 12/ 160| tot_loss: 7.4314 | rl_loss: 4.3438 | pt_loss: 3.0876 | pg_loss: 1.1035 | reg_loss: 3.2402 | reward: -2.3125 | rev_kl: 2.8866 | stu_lens: 65.3125 | mixed_lens: 64.6250 | lr: 4.5000e-07 | scale: 2048.00 | time: 0.357 | step time: 0.357 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 13/ 160| tot_loss: 6.0105 | rl_loss: 2.9010 | pt_loss: 3.1095 | pg_loss: 0.7505 | reg_loss: 2.1505 | reward: -1.8770 | rev_kl: 2.5659 | stu_lens: 73.8750 | mixed_lens: 66.3750 | lr: 5.0000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 14/ 160| tot_loss: 7.8888 | rl_loss: 4.7281 | pt_loss: 3.1607 | pg_loss: 1.4379 | reg_loss: 3.2901 | reward: -1.6855 | rev_kl: 2.4188 | stu_lens: 33.4375 | mixed_lens: 50.4375 | lr: 5.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 15/ 160| tot_loss: 6.9703 | rl_loss: 3.8623 | pt_loss: 3.1080 | pg_loss: 1.1477 | reg_loss: 2.7146 | reward: -2.4434 | rev_kl: 2.8736 | stu_lens: 60.0625 | mixed_lens: 53.5000 | lr: 6.0000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 16/ 160| tot_loss: 7.3526 | rl_loss: 4.0999 | pt_loss: 3.2527 | pg_loss: 1.0451 | reg_loss: 3.0547 | reward: -1.7373 | rev_kl: 2.7595 | stu_lens: 40.6250 | mixed_lens: 47.7500 | lr: 6.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 17/ 160| tot_loss: 6.5685 | rl_loss: 3.4502 | pt_loss: 3.1183 | pg_loss: 0.7241 | reg_loss: 2.7262 | reward: -2.3555 | rev_kl: 2.2575 | stu_lens: 43.0000 | mixed_lens: 66.9375 | lr: 7.0000e-07 | scale: 2048.00 | time: 2.041 | step time: 2.041 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 18/ 160| tot_loss: 6.6946 | rl_loss: 3.5981 | pt_loss: 3.0966 | pg_loss: 1.5656 | reg_loss: 2.0325 | reward: -2.7012 | rev_kl: 2.4096 | stu_lens: 47.0000 | mixed_lens: 43.9375 | lr: 7.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 19/ 160| tot_loss: 6.4938 | rl_loss: 3.5470 | pt_loss: 2.9468 | pg_loss: 1.2713 | reg_loss: 2.2757 | reward: -1.4473 | rev_kl: 3.6484 | stu_lens: 92.0625 | mixed_lens: 81.1250 | lr: 8.0000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 eval | rougeL: 21.269 | exact_match: 3.100 | rev_kl: 2.411 | lens: 58.546 | pt_loss: 3.014 | lm_loss: 3.430 | kd_loss: 2.597 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 20/ 160| tot_loss: 5.8548 | rl_loss: 2.6765 | pt_loss: 3.1783 | pg_loss: 0.4331 | reg_loss: 2.2434 | reward: -1.2764 | rev_kl: 2.6105 | stu_lens: 74.5625 | mixed_lens: 65.2500 | lr: 8.5000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 21/ 160| tot_loss: 6.9614 | rl_loss: 3.6870 | pt_loss: 3.2743 | pg_loss: 0.8843 | reg_loss: 2.8028 | reward: -2.1367 | rev_kl: 2.5854 | stu_lens: 53.1875 | mixed_lens: 60.6250 | lr: 9.0000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 22/ 160| tot_loss: 6.1811 | rl_loss: 3.2499 | pt_loss: 2.9312 | pg_loss: 0.8077 | reg_loss: 2.4423 | reward: -0.9253 | rev_kl: 2.1572 | stu_lens: 66.3750 | mixed_lens: 81.0000 | lr: 9.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 23/ 160| tot_loss: 5.1112 | rl_loss: 2.0210 | pt_loss: 3.0902 | pg_loss: 0.3897 | reg_loss: 1.6313 | reward: -1.9258 | rev_kl: 2.7548 | stu_lens: 78.8125 | mixed_lens: 62.5625 | lr: 1.0000e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 24/ 160| tot_loss: 6.7964 | rl_loss: 3.7504 | pt_loss: 3.0460 | pg_loss: 1.0821 | reg_loss: 2.6683 | reward: -2.7910 | rev_kl: 3.4285 | stu_lens: 58.2500 | mixed_lens: 53.0625 | lr: 1.0500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 25/ 160| tot_loss: 6.5395 | rl_loss: 3.4161 | pt_loss: 3.1233 | pg_loss: 0.7052 | reg_loss: 2.7109 | reward: -1.4795 | rev_kl: 2.3824 | stu_lens: 52.5000 | mixed_lens: 82.5000 | lr: 1.1000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 26/ 160| tot_loss: 6.3982 | rl_loss: 3.0625 | pt_loss: 3.3358 | pg_loss: 0.6713 | reg_loss: 2.3911 | reward: -1.3428 | rev_kl: 2.6665 | stu_lens: 94.7500 | mixed_lens: 75.9375 | lr: 1.1500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 27/ 160| tot_loss: 5.5353 | rl_loss: 2.4059 | pt_loss: 3.1294 | pg_loss: 0.4697 | reg_loss: 1.9362 | reward: -2.4395 | rev_kl: 3.1394 | stu_lens: 69.0000 | mixed_lens: 57.8125 | lr: 1.2000e-06 | scale: 2048.00 | time: 0.371 | step time: 0.371 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 28/ 160| tot_loss: 8.0861 | rl_loss: 4.7915 | pt_loss: 3.2946 | pg_loss: 2.2176 | reg_loss: 2.5739 | reward: -2.5176 | rev_kl: 2.7376 | stu_lens: 40.3750 | mixed_lens: 41.0000 | lr: 1.2500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 29/ 160| tot_loss: 6.5176 | rl_loss: 3.3472 | pt_loss: 3.1705 | pg_loss: 0.7683 | reg_loss: 2.5789 | reward: -1.5840 | rev_kl: 2.6401 | stu_lens: 75.7500 | mixed_lens: 59.9375 | lr: 1.3000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 30/ 160| tot_loss: 6.6756 | rl_loss: 3.7310 | pt_loss: 2.9446 | pg_loss: 0.9884 | reg_loss: 2.7426 | reward: -2.5059 | rev_kl: 2.5335 | stu_lens: 58.5625 | mixed_lens: 63.5000 | lr: 1.3500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 31/ 160| tot_loss: 6.0135 | rl_loss: 2.8948 | pt_loss: 3.1187 | pg_loss: 0.6172 | reg_loss: 2.2776 | reward: -1.7734 | rev_kl: 2.6676 | stu_lens: 63.7500 | mixed_lens: 67.0000 | lr: 1.4000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 32/ 160| tot_loss: 5.7564 | rl_loss: 2.4033 | pt_loss: 3.3532 | pg_loss: 0.7497 | reg_loss: 1.6536 | reward: -1.9160 | rev_kl: 3.0847 | stu_lens: 58.5625 | mixed_lens: 66.8125 | lr: 1.4500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 33/ 160| tot_loss: 7.4187 | rl_loss: 4.3375 | pt_loss: 3.0811 | pg_loss: 1.6900 | reg_loss: 2.6475 | reward: -1.6006 | rev_kl: 1.8722 | stu_lens: 51.9375 | mixed_lens: 56.6875 | lr: 1.5000e-06 | scale: 2048.00 | time: 0.596 | step time: 0.596 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 34/ 160| tot_loss: 6.8117 | rl_loss: 3.7359 | pt_loss: 3.0758 | pg_loss: 0.8222 | reg_loss: 2.9137 | reward: -1.2988 | rev_kl: 2.2187 | stu_lens: 68.2500 | mixed_lens: 71.1875 | lr: 1.5500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 35/ 160| tot_loss: 6.7694 | rl_loss: 3.6155 | pt_loss: 3.1540 | pg_loss: 0.7880 | reg_loss: 2.8275 | reward: -1.1836 | rev_kl: 2.4549 | stu_lens: 62.8125 | mixed_lens: 78.6875 | lr: 1.6000e-06 | scale: 2048.00 | time: 0.369 | step time: 0.369 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 36/ 160| tot_loss: 7.3283 | rl_loss: 4.2418 | pt_loss: 3.0865 | pg_loss: 1.0094 | reg_loss: 3.2323 | reward: -1.7773 | rev_kl: 2.3933 | stu_lens: 98.4375 | mixed_lens: 64.7500 | lr: 1.6500e-06 | scale: 2048.00 | time: 0.369 | step time: 0.369 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 37/ 160| tot_loss: 7.0674 | rl_loss: 3.9013 | pt_loss: 3.1661 | pg_loss: 1.2715 | reg_loss: 2.6297 | reward: -1.2119 | rev_kl: 2.0438 | stu_lens: 77.8750 | mixed_lens: 48.2500 | lr: 1.7000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 38/ 160| tot_loss: 8.3326 | rl_loss: 5.2152 | pt_loss: 3.1175 | pg_loss: 2.4236 | reg_loss: 2.7915 | reward: -1.5059 | rev_kl: 2.1339 | stu_lens: 50.8750 | mixed_lens: 53.8125 | lr: 1.7500e-06 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 39/ 160| tot_loss: 6.6875 | rl_loss: 3.9191 | pt_loss: 2.7684 | pg_loss: 1.0256 | reg_loss: 2.8936 | reward: -1.3027 | rev_kl: 2.5136 | stu_lens: 101.9375 | mixed_lens: 81.1250 | lr: 1.8000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 eval | rougeL: 21.167 | exact_match: 2.800 | rev_kl: 2.446 | lens: 65.074 | pt_loss: 3.013 | lm_loss: 3.432 | kd_loss: 2.593 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 40/ 160| tot_loss: 6.4726 | rl_loss: 3.4598 | pt_loss: 3.0127 | pg_loss: 0.4929 | reg_loss: 2.9669 | reward: -1.8398 | rev_kl: 2.2477 | stu_lens: 50.7500 | mixed_lens: 88.1250 | lr: 1.8500e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 41/ 160| tot_loss: 6.7486 | rl_loss: 3.9185 | pt_loss: 2.8302 | pg_loss: 1.1422 | reg_loss: 2.7763 | reward: -1.9473 | rev_kl: 2.3467 | stu_lens: 60.8750 | mixed_lens: 60.3750 | lr: 1.9000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 42/ 160| tot_loss: 7.0553 | rl_loss: 4.0452 | pt_loss: 3.0101 | pg_loss: 0.9356 | reg_loss: 3.1096 | reward: -1.9658 | rev_kl: 2.9078 | stu_lens: 93.1875 | mixed_lens: 70.7500 | lr: 1.9500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 43/ 160| tot_loss: 6.1622 | rl_loss: 3.1271 | pt_loss: 3.0351 | pg_loss: 0.5073 | reg_loss: 2.6197 | reward: -0.7446 | rev_kl: 1.7225 | stu_lens: 54.8750 | mixed_lens: 78.5625 | lr: 2.0000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 44/ 160| tot_loss: 7.6874 | rl_loss: 4.6414 | pt_loss: 3.0460 | pg_loss: 1.5165 | reg_loss: 3.1249 | reward: -1.2021 | rev_kl: 1.9620 | stu_lens: 72.5000 | mixed_lens: 61.6250 | lr: 2.0500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 45/ 160| tot_loss: 7.1404 | rl_loss: 3.9895 | pt_loss: 3.1509 | pg_loss: 1.1287 | reg_loss: 2.8608 | reward: -1.3838 | rev_kl: 2.0007 | stu_lens: 45.5625 | mixed_lens: 51.0625 | lr: 2.1000e-06 | scale: 2048.00 | time: 0.370 | step time: 0.370 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 46/ 160| tot_loss: 6.4918 | rl_loss: 3.2966 | pt_loss: 3.1952 | pg_loss: 0.4387 | reg_loss: 2.8580 | reward: -1.7373 | rev_kl: 2.2316 | stu_lens: 88.8750 | mixed_lens: 69.1875 | lr: 2.1500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 47/ 160| tot_loss: 7.1947 | rl_loss: 4.3234 | pt_loss: 2.8713 | pg_loss: 1.4248 | reg_loss: 2.8986 | reward: -1.1377 | rev_kl: 2.2990 | stu_lens: 89.5625 | mixed_lens: 80.9375 | lr: 2.2000e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 48/ 160| tot_loss: 7.4945 | rl_loss: 4.0246 | pt_loss: 3.4700 | pg_loss: 1.1872 | reg_loss: 2.8374 | reward: -1.6016 | rev_kl: 2.4077 | stu_lens: 57.4375 | mixed_lens: 70.1250 | lr: 2.2500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 49/ 160| tot_loss: 6.2650 | rl_loss: 3.1572 | pt_loss: 3.1078 | pg_loss: 0.5124 | reg_loss: 2.6448 | reward: -2.0586 | rev_kl: 2.3399 | stu_lens: 76.0625 | mixed_lens: 57.7500 | lr: 2.3000e-06 | scale: 2048.00 | time: 0.645 | step time: 0.645 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 50/ 160| tot_loss: 6.8729 | rl_loss: 3.9255 | pt_loss: 2.9474 | pg_loss: 1.4219 | reg_loss: 2.5036 | reward: -1.5967 | rev_kl: 3.5179 | stu_lens: 74.5000 | mixed_lens: 72.9375 | lr: 2.3500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 51/ 160| tot_loss: 5.9044 | rl_loss: 2.8643 | pt_loss: 3.0401 | pg_loss: 0.5225 | reg_loss: 2.3418 | reward: -2.4766 | rev_kl: 2.5543 | stu_lens: 42.3750 | mixed_lens: 55.2500 | lr: 2.4000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 52/ 160| tot_loss: 6.9690 | rl_loss: 3.9482 | pt_loss: 3.0209 | pg_loss: 1.5998 | reg_loss: 2.3483 | reward: -1.6885 | rev_kl: 3.1211 | stu_lens: 44.8750 | mixed_lens: 47.7500 | lr: 2.4500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 53/ 160| tot_loss: 5.0386 | rl_loss: 2.0322 | pt_loss: 3.0064 | pg_loss: 0.7039 | reg_loss: 1.3283 | reward: -1.6348 | rev_kl: 2.8166 | stu_lens: 58.8750 | mixed_lens: 62.1875 | lr: 2.5000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 54/ 160| tot_loss: 5.6647 | rl_loss: 2.4420 | pt_loss: 3.2227 | pg_loss: 0.6343 | reg_loss: 1.8076 | reward: -2.1270 | rev_kl: 3.2066 | stu_lens: 62.1250 | mixed_lens: 47.9375 | lr: 2.5500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 55/ 160| tot_loss: 6.9695 | rl_loss: 3.9436 | pt_loss: 3.0259 | pg_loss: 1.2622 | reg_loss: 2.6814 | reward: -2.0723 | rev_kl: 2.5787 | stu_lens: 50.4375 | mixed_lens: 51.5625 | lr: 2.6000e-06 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 56/ 160| tot_loss: 6.6155 | rl_loss: 3.4524 | pt_loss: 3.1631 | pg_loss: 0.6906 | reg_loss: 2.7618 | reward: -1.9854 | rev_kl: 2.9313 | stu_lens: 66.3750 | mixed_lens: 72.0000 | lr: 2.6500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 57/ 160| tot_loss: 5.0205 | rl_loss: 1.8792 | pt_loss: 3.1413 | pg_loss: 0.3246 | reg_loss: 1.5546 | reward: -2.3691 | rev_kl: 2.1796 | stu_lens: 61.6875 | mixed_lens: 40.7500 | lr: 2.7000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 58/ 160| tot_loss: 7.2452 | rl_loss: 4.2972 | pt_loss: 2.9479 | pg_loss: 1.8306 | reg_loss: 2.4666 | reward: -1.8516 | rev_kl: 3.0155 | stu_lens: 73.3125 | mixed_lens: 65.4375 | lr: 2.7500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 59/ 160| tot_loss: 5.7360 | rl_loss: 2.8458 | pt_loss: 2.8902 | pg_loss: 0.6121 | reg_loss: 2.2337 | reward: -1.3867 | rev_kl: 2.4098 | stu_lens: 68.0000 | mixed_lens: 56.3750 | lr: 2.8000e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 eval | rougeL: 21.786 | exact_match: 2.800 | rev_kl: 2.304 | lens: 65.498 | pt_loss: 3.012 | lm_loss: 3.439 | kd_loss: 2.584 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 60/ 160| tot_loss: 7.2182 | rl_loss: 3.8136 | pt_loss: 3.4046 | pg_loss: 0.9123 | reg_loss: 2.9013 | reward: -2.2109 | rev_kl: 3.9282 | stu_lens: 34.8125 | mixed_lens: 71.1250 | lr: 2.8500e-06 | scale: 2048.00 | time: 0.371 | step time: 0.371 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 61/ 160| tot_loss: 6.2466 | rl_loss: 3.0557 | pt_loss: 3.1909 | pg_loss: 1.1088 | reg_loss: 1.9469 | reward: -2.2109 | rev_kl: 3.3022 | stu_lens: 48.6875 | mixed_lens: 60.5625 | lr: 2.9000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 62/ 160| tot_loss: 6.3855 | rl_loss: 3.3054 | pt_loss: 3.0801 | pg_loss: 0.8967 | reg_loss: 2.4086 | reward: -1.7705 | rev_kl: 2.4576 | stu_lens: 55.5000 | mixed_lens: 60.6875 | lr: 2.9500e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 63/ 160| tot_loss: 6.6113 | rl_loss: 3.3963 | pt_loss: 3.2150 | pg_loss: 0.6002 | reg_loss: 2.7961 | reward: -1.7734 | rev_kl: 2.8392 | stu_lens: 60.0000 | mixed_lens: 55.5625 | lr: 3.0000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 64/ 160| tot_loss: 6.0692 | rl_loss: 2.8235 | pt_loss: 3.2457 | pg_loss: 0.9977 | reg_loss: 1.8258 | reward: -2.0645 | rev_kl: 2.9342 | stu_lens: 73.6250 | mixed_lens: 56.8750 | lr: 3.0500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 65/ 160| tot_loss: 5.5424 | rl_loss: 2.2079 | pt_loss: 3.3345 | pg_loss: 1.2006 | reg_loss: 1.0073 | reward: -1.7334 | rev_kl: 2.3148 | stu_lens: 81.7500 | mixed_lens: 51.9375 | lr: 3.1000e-06 | scale: 2048.00 | time: 0.411 | step time: 0.411 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 66/ 160| tot_loss: 6.4153 | rl_loss: 3.2441 | pt_loss: 3.1712 | pg_loss: 0.6562 | reg_loss: 2.5879 | reward: -1.0107 | rev_kl: 2.7347 | stu_lens: 69.0000 | mixed_lens: 75.8750 | lr: 3.1500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 67/ 160| tot_loss: 6.8531 | rl_loss: 3.6034 | pt_loss: 3.2498 | pg_loss: 0.7934 | reg_loss: 2.8099 | reward: -2.5996 | rev_kl: 2.8171 | stu_lens: 90.5625 | mixed_lens: 93.6250 | lr: 3.2000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 68/ 160| tot_loss: 6.7168 | rl_loss: 3.3578 | pt_loss: 3.3590 | pg_loss: 0.6544 | reg_loss: 2.7034 | reward: -2.1895 | rev_kl: 2.2260 | stu_lens: 61.1250 | mixed_lens: 73.5625 | lr: 3.2500e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 69/ 160| tot_loss: 6.2529 | rl_loss: 3.2390 | pt_loss: 3.0139 | pg_loss: 0.4003 | reg_loss: 2.8388 | reward: -1.2725 | rev_kl: 2.6633 | stu_lens: 78.4375 | mixed_lens: 86.6875 | lr: 3.3000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 70/ 160| tot_loss: 6.9206 | rl_loss: 3.6456 | pt_loss: 3.2750 | pg_loss: 0.8808 | reg_loss: 2.7648 | reward: -2.6641 | rev_kl: 2.9559 | stu_lens: 78.8125 | mixed_lens: 87.7500 | lr: 3.3500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 71/ 160| tot_loss: 6.2344 | rl_loss: 3.5199 | pt_loss: 2.7145 | pg_loss: 1.8389 | reg_loss: 1.6810 | reward: -1.7363 | rev_kl: 1.9849 | stu_lens: 59.8750 | mixed_lens: 52.6250 | lr: 3.3500e-06 | scale: 1024.00 | time: 0.358 | step time: 0.358 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 72/ 160| tot_loss: 5.2262 | rl_loss: 2.2246 | pt_loss: 3.0016 | pg_loss: 0.7458 | reg_loss: 1.4788 | reward: -1.8594 | rev_kl: 2.4886 | stu_lens: 85.3125 | mixed_lens: 67.9375 | lr: 3.4000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 73/ 160| tot_loss: 6.0056 | rl_loss: 2.9609 | pt_loss: 3.0447 | pg_loss: 0.6945 | reg_loss: 2.2664 | reward: -1.7812 | rev_kl: 2.6694 | stu_lens: 57.3125 | mixed_lens: 61.8125 | lr: 3.4500e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 74/ 160| tot_loss: 6.3009 | rl_loss: 3.4539 | pt_loss: 2.8470 | pg_loss: 0.6392 | reg_loss: 2.8147 | reward: -0.7598 | rev_kl: 2.4806 | stu_lens: 133.7500 | mixed_lens: 128.8750 | lr: 3.5000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 75/ 160| tot_loss: 5.6579 | rl_loss: 2.7076 | pt_loss: 2.9502 | pg_loss: 0.5680 | reg_loss: 2.1396 | reward: -1.4854 | rev_kl: 2.4697 | stu_lens: 56.9375 | mixed_lens: 62.0000 | lr: 3.5500e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 76/ 160| tot_loss: 7.6500 | rl_loss: 4.4674 | pt_loss: 3.1826 | pg_loss: 2.0777 | reg_loss: 2.3897 | reward: -3.5078 | rev_kl: 2.4729 | stu_lens: 54.4375 | mixed_lens: 42.3125 | lr: 3.6000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 77/ 160| tot_loss: 5.7953 | rl_loss: 2.7314 | pt_loss: 3.0639 | pg_loss: 0.3760 | reg_loss: 2.3554 | reward: -2.2715 | rev_kl: 2.8485 | stu_lens: 52.9375 | mixed_lens: 60.3125 | lr: 3.6500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 78/ 160| tot_loss: 5.6848 | rl_loss: 2.4691 | pt_loss: 3.2157 | pg_loss: 0.3626 | reg_loss: 2.1065 | reward: -0.7622 | rev_kl: 2.3950 | stu_lens: 96.0625 | mixed_lens: 104.3125 | lr: 3.7000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 79/ 160| tot_loss: 7.3166 | rl_loss: 4.4231 | pt_loss: 2.8935 | pg_loss: 1.5328 | reg_loss: 2.8902 | reward: -2.8418 | rev_kl: 2.4538 | stu_lens: 58.6875 | mixed_lens: 46.1875 | lr: 3.7500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 eval | rougeL: 23.098 | exact_match: 3.300 | rev_kl: 2.128 | lens: 74.373 | pt_loss: 3.014 | lm_loss: 3.450 | kd_loss: 2.578 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 80/ 160| tot_loss: 6.5506 | rl_loss: 3.5699 | pt_loss: 2.9807 | pg_loss: 1.1578 | reg_loss: 2.4121 | reward: -1.6592 | rev_kl: 2.3953 | stu_lens: 94.7500 | mixed_lens: 84.1875 | lr: 3.8000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 81/ 160| tot_loss: 7.0416 | rl_loss: 3.6395 | pt_loss: 3.4021 | pg_loss: 0.8753 | reg_loss: 2.7642 | reward: -1.7236 | rev_kl: 2.4504 | stu_lens: 32.3125 | mixed_lens: 62.1875 | lr: 3.8500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 82/ 160| tot_loss: 5.4813 | rl_loss: 2.2122 | pt_loss: 3.2690 | pg_loss: 0.8378 | reg_loss: 1.3745 | reward: -1.8965 | rev_kl: 2.2782 | stu_lens: 61.3125 | mixed_lens: 55.1875 | lr: 3.9000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 83/ 160| tot_loss: 6.3257 | rl_loss: 3.0270 | pt_loss: 3.2987 | pg_loss: 0.4682 | reg_loss: 2.5588 | reward: -0.9775 | rev_kl: 2.3965 | stu_lens: 108.2500 | mixed_lens: 90.5000 | lr: 3.9500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 84/ 160| tot_loss: 5.7535 | rl_loss: 3.1597 | pt_loss: 2.5938 | pg_loss: 0.7851 | reg_loss: 2.3746 | reward: -2.6172 | rev_kl: 1.7460 | stu_lens: 70.6250 | mixed_lens: 49.1875 | lr: 4.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 85/ 160| tot_loss: 4.5958 | rl_loss: 1.4772 | pt_loss: 3.1186 | pg_loss: 0.6324 | reg_loss: 0.8448 | reward: -2.4883 | rev_kl: 1.7702 | stu_lens: 32.5000 | mixed_lens: 46.4375 | lr: 4.0500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 86/ 160| tot_loss: 6.3875 | rl_loss: 3.2992 | pt_loss: 3.0883 | pg_loss: 0.6216 | reg_loss: 2.6776 | reward: -1.4160 | rev_kl: 2.6033 | stu_lens: 98.3125 | mixed_lens: 82.4375 | lr: 4.1000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 87/ 160| tot_loss: 6.8292 | rl_loss: 3.8961 | pt_loss: 2.9330 | pg_loss: 1.0610 | reg_loss: 2.8351 | reward: -0.9526 | rev_kl: 2.0824 | stu_lens: 78.8750 | mixed_lens: 55.8750 | lr: 4.1500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 88/ 160| tot_loss: 6.0350 | rl_loss: 2.7158 | pt_loss: 3.3192 | pg_loss: 0.8140 | reg_loss: 1.9018 | reward: -2.3574 | rev_kl: 2.4151 | stu_lens: 62.8125 | mixed_lens: 72.3125 | lr: 4.2000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 89/ 160| tot_loss: 5.8060 | rl_loss: 2.8232 | pt_loss: 2.9828 | pg_loss: 0.4847 | reg_loss: 2.3384 | reward: -1.4824 | rev_kl: 2.2524 | stu_lens: 80.6875 | mixed_lens: 63.3125 | lr: 4.2500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 90/ 160| tot_loss: 6.7087 | rl_loss: 3.6483 | pt_loss: 3.0604 | pg_loss: 0.9358 | reg_loss: 2.7125 | reward: -0.9570 | rev_kl: 2.4727 | stu_lens: 67.1875 | mixed_lens: 70.8750 | lr: 4.3000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 91/ 160| tot_loss: 6.1429 | rl_loss: 2.8992 | pt_loss: 3.2437 | pg_loss: 0.6564 | reg_loss: 2.2427 | reward: -1.2529 | rev_kl: 2.2809 | stu_lens: 71.3125 | mixed_lens: 84.2500 | lr: 4.3500e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 92/ 160| tot_loss: 6.6836 | rl_loss: 3.0124 | pt_loss: 3.6713 | pg_loss: 0.8500 | reg_loss: 2.1624 | reward: -3.5215 | rev_kl: 1.8651 | stu_lens: 53.3125 | mixed_lens: 38.6250 | lr: 4.4000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 93/ 160| tot_loss: 6.1867 | rl_loss: 3.1526 | pt_loss: 3.0341 | pg_loss: 0.8937 | reg_loss: 2.2589 | reward: -3.5273 | rev_kl: 1.8690 | stu_lens: 50.6875 | mixed_lens: 49.7500 | lr: 4.4500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 94/ 160| tot_loss: 5.3309 | rl_loss: 1.8881 | pt_loss: 3.4429 | pg_loss: 0.0555 | reg_loss: 1.8325 | reward: -1.5225 | rev_kl: 2.3576 | stu_lens: 83.3750 | mixed_lens: 60.8750 | lr: 4.5000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 95/ 160| tot_loss: 6.3594 | rl_loss: 3.3826 | pt_loss: 2.9768 | pg_loss: 0.6333 | reg_loss: 2.7492 | reward: -0.9434 | rev_kl: 2.6928 | stu_lens: 93.5000 | mixed_lens: 97.8125 | lr: 4.5500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 96/ 160| tot_loss: 6.4160 | rl_loss: 3.3136 | pt_loss: 3.1024 | pg_loss: 1.8859 | reg_loss: 1.4277 | reward: -1.2217 | rev_kl: 1.9516 | stu_lens: 44.9375 | mixed_lens: 48.6250 | lr: 4.6000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 97/ 160| tot_loss: 7.1303 | rl_loss: 3.8633 | pt_loss: 3.2670 | pg_loss: 1.4134 | reg_loss: 2.4500 | reward: -0.8579 | rev_kl: 2.2739 | stu_lens: 78.7500 | mixed_lens: 57.1250 | lr: 4.6500e-06 | scale: 1024.00 | time: 5.013 | step time: 5.013 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 98/ 160| tot_loss: 6.3917 | rl_loss: 3.3670 | pt_loss: 3.0246 | pg_loss: 0.9592 | reg_loss: 2.4078 | reward: -1.0117 | rev_kl: 1.8512 | stu_lens: 64.0625 | mixed_lens: 67.3750 | lr: 4.7000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 99/ 160| tot_loss: 6.9055 | rl_loss: 3.3964 | pt_loss: 3.5091 | pg_loss: 0.9960 | reg_loss: 2.4004 | reward: -2.0723 | rev_kl: 3.4908 | stu_lens: 45.9375 | mixed_lens: 56.3750 | lr: 4.7500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 eval | rougeL: 24.365 | exact_match: 3.500 | rev_kl: 2.109 | lens: 72.274 | pt_loss: 3.020 | lm_loss: 3.466 | kd_loss: 2.574 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 100/ 160| tot_loss: 7.2655 | rl_loss: 4.2092 | pt_loss: 3.0564 | pg_loss: 1.8737 | reg_loss: 2.3355 | reward: -1.5566 | rev_kl: 2.4011 | stu_lens: 99.9375 | mixed_lens: 55.2500 | lr: 4.8000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 101/ 160| tot_loss: 6.7582 | rl_loss: 3.7091 | pt_loss: 3.0492 | pg_loss: 1.0853 | reg_loss: 2.6238 | reward: -1.2754 | rev_kl: 2.2437 | stu_lens: 92.8750 | mixed_lens: 72.5625 | lr: 4.8500e-06 | scale: 1024.00 | time: 0.662 | step time: 0.662 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 102/ 160| tot_loss: 5.7520 | rl_loss: 2.4874 | pt_loss: 3.2646 | pg_loss: 0.8479 | reg_loss: 1.6394 | reward: -1.2607 | rev_kl: 3.1277 | stu_lens: 70.1875 | mixed_lens: 73.5000 | lr: 4.9000e-06 | scale: 1024.00 | time: 0.369 | step time: 0.369 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 103/ 160| tot_loss: 7.1018 | rl_loss: 4.1950 | pt_loss: 2.9069 | pg_loss: 1.8200 | reg_loss: 2.3749 | reward: -1.8018 | rev_kl: 2.7699 | stu_lens: 59.0000 | mixed_lens: 36.1250 | lr: 4.9500e-06 | scale: 1024.00 | time: 3.445 | step time: 3.445 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 104/ 160| tot_loss: 6.6216 | rl_loss: 3.6208 | pt_loss: 3.0008 | pg_loss: 1.2250 | reg_loss: 2.3958 | reward: -1.1611 | rev_kl: 1.8757 | stu_lens: 66.6250 | mixed_lens: 53.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.374 | step time: 0.374 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 105/ 160| tot_loss: 6.6670 | rl_loss: 3.4995 | pt_loss: 3.1676 | pg_loss: 1.0114 | reg_loss: 2.4881 | reward: -1.5938 | rev_kl: 2.7500 | stu_lens: 79.0000 | mixed_lens: 80.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 106/ 160| tot_loss: 6.2633 | rl_loss: 3.5184 | pt_loss: 2.7449 | pg_loss: 1.1150 | reg_loss: 2.4034 | reward: -1.8125 | rev_kl: 2.3870 | stu_lens: 55.0625 | mixed_lens: 50.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 107/ 160| tot_loss: 7.7823 | rl_loss: 4.6325 | pt_loss: 3.1498 | pg_loss: 2.2491 | reg_loss: 2.3835 | reward: -1.1035 | rev_kl: 1.7522 | stu_lens: 55.8125 | mixed_lens: 36.7500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 108/ 160| tot_loss: 5.4438 | rl_loss: 2.3230 | pt_loss: 3.1208 | pg_loss: 0.7814 | reg_loss: 1.5415 | reward: -0.9888 | rev_kl: 3.1278 | stu_lens: 98.8125 | mixed_lens: 67.8750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.369 | step time: 0.369 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 109/ 160| tot_loss: 7.3565 | rl_loss: 4.1684 | pt_loss: 3.1881 | pg_loss: 1.5070 | reg_loss: 2.6614 | reward: -1.2510 | rev_kl: 3.0104 | stu_lens: 47.4375 | mixed_lens: 62.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 110/ 160| tot_loss: 6.9757 | rl_loss: 4.0425 | pt_loss: 2.9332 | pg_loss: 1.5346 | reg_loss: 2.5078 | reward: -1.9541 | rev_kl: 2.2952 | stu_lens: 88.3750 | mixed_lens: 50.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 111/ 160| tot_loss: 5.9752 | rl_loss: 2.8322 | pt_loss: 3.1430 | pg_loss: 1.1665 | reg_loss: 1.6657 | reward: -1.1338 | rev_kl: 2.7708 | stu_lens: 84.1875 | mixed_lens: 65.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 112/ 160| tot_loss: 6.2955 | rl_loss: 3.1527 | pt_loss: 3.1428 | pg_loss: 0.8947 | reg_loss: 2.2580 | reward: -1.1602 | rev_kl: 1.9407 | stu_lens: 68.6875 | mixed_lens: 57.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.654 | step time: 0.654 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 113/ 160| tot_loss: 6.4900 | rl_loss: 3.4650 | pt_loss: 3.0250 | pg_loss: 0.7889 | reg_loss: 2.6761 | reward: -1.4492 | rev_kl: 2.3875 | stu_lens: 54.6250 | mixed_lens: 76.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.438 | step time: 0.438 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 114/ 160| tot_loss: 5.8606 | rl_loss: 2.7508 | pt_loss: 3.1097 | pg_loss: 0.4064 | reg_loss: 2.3445 | reward: -0.9619 | rev_kl: 2.1390 | stu_lens: 95.6250 | mixed_lens: 109.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 115/ 160| tot_loss: 7.3387 | rl_loss: 4.1205 | pt_loss: 3.2182 | pg_loss: 1.3736 | reg_loss: 2.7469 | reward: -1.4561 | rev_kl: 2.0730 | stu_lens: 69.4375 | mixed_lens: 62.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 116/ 160| tot_loss: 6.7367 | rl_loss: 3.5925 | pt_loss: 3.1442 | pg_loss: 1.0952 | reg_loss: 2.4973 | reward: -0.4773 | rev_kl: 2.5111 | stu_lens: 62.1875 | mixed_lens: 64.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 117/ 160| tot_loss: 5.7252 | rl_loss: 2.5792 | pt_loss: 3.1460 | pg_loss: 0.3592 | reg_loss: 2.2200 | reward: -2.0254 | rev_kl: 2.4545 | stu_lens: 85.4375 | mixed_lens: 58.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 118/ 160| tot_loss: 5.7815 | rl_loss: 2.7799 | pt_loss: 3.0017 | pg_loss: 0.7179 | reg_loss: 2.0620 | reward: -0.6519 | rev_kl: 1.8954 | stu_lens: 51.0625 | mixed_lens: 103.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 119/ 160| tot_loss: 6.6016 | rl_loss: 3.6464 | pt_loss: 2.9552 | pg_loss: 0.9027 | reg_loss: 2.7437 | reward: -0.4995 | rev_kl: 2.6690 | stu_lens: 84.5000 | mixed_lens: 80.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 eval | rougeL: 23.271 | exact_match: 3.100 | rev_kl: 1.938 | lens: 72.509 | pt_loss: 3.022 | lm_loss: 3.471 | kd_loss: 2.572 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 120/ 160| tot_loss: 7.6094 | rl_loss: 4.4114 | pt_loss: 3.1981 | pg_loss: 1.3944 | reg_loss: 3.0169 | reward: -1.1670 | rev_kl: 2.0917 | stu_lens: 60.8750 | mixed_lens: 69.6250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 121/ 160| tot_loss: 6.7497 | rl_loss: 3.5124 | pt_loss: 3.2373 | pg_loss: 0.8017 | reg_loss: 2.7107 | reward: -1.1982 | rev_kl: 2.9580 | stu_lens: 87.1250 | mixed_lens: 86.0625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 122/ 160| tot_loss: 7.0310 | rl_loss: 3.8811 | pt_loss: 3.1499 | pg_loss: 1.7440 | reg_loss: 2.1371 | reward: -0.7183 | rev_kl: 1.8985 | stu_lens: 51.6875 | mixed_lens: 64.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 123/ 160| tot_loss: 6.0254 | rl_loss: 2.6467 | pt_loss: 3.3787 | pg_loss: 0.6445 | reg_loss: 2.0022 | reward: -1.2080 | rev_kl: 2.2109 | stu_lens: 58.8750 | mixed_lens: 77.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.363 | step time: 0.363 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 124/ 160| tot_loss: 6.5754 | rl_loss: 3.3731 | pt_loss: 3.2023 | pg_loss: 0.5857 | reg_loss: 2.7874 | reward: -1.2207 | rev_kl: 2.0432 | stu_lens: 84.1875 | mixed_lens: 84.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 125/ 160| tot_loss: 6.8055 | rl_loss: 3.5715 | pt_loss: 3.2340 | pg_loss: 1.0180 | reg_loss: 2.5534 | reward: -1.1279 | rev_kl: 1.9385 | stu_lens: 69.0000 | mixed_lens: 85.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 126/ 160| tot_loss: 5.9534 | rl_loss: 2.5188 | pt_loss: 3.4346 | pg_loss: 0.2802 | reg_loss: 2.2386 | reward: -1.1680 | rev_kl: 2.4618 | stu_lens: 107.8750 | mixed_lens: 100.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 127/ 160| tot_loss: 7.2144 | rl_loss: 3.9797 | pt_loss: 3.2347 | pg_loss: 1.2139 | reg_loss: 2.7658 | reward: -0.5410 | rev_kl: 2.7889 | stu_lens: 54.1875 | mixed_lens: 78.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 128/ 160| tot_loss: 6.2720 | rl_loss: 3.2531 | pt_loss: 3.0189 | pg_loss: 0.9293 | reg_loss: 2.3237 | reward: -1.5078 | rev_kl: 1.9215 | stu_lens: 50.8125 | mixed_lens: 47.6250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 129/ 160| tot_loss: 6.8848 | rl_loss: 3.7039 | pt_loss: 3.1809 | pg_loss: 1.1745 | reg_loss: 2.5294 | reward: -1.6123 | rev_kl: 1.9468 | stu_lens: 60.6250 | mixed_lens: 52.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.469 | step time: 0.469 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 130/ 160| tot_loss: 6.3558 | rl_loss: 3.3180 | pt_loss: 3.0378 | pg_loss: 0.6808 | reg_loss: 2.6372 | reward: -1.6797 | rev_kl: 3.0009 | stu_lens: 87.9375 | mixed_lens: 93.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 131/ 160| tot_loss: 7.8841 | rl_loss: 5.1608 | pt_loss: 2.7233 | pg_loss: 2.1873 | reg_loss: 2.9735 | reward: -1.1123 | rev_kl: 1.7497 | stu_lens: 92.0625 | mixed_lens: 62.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 132/ 160| tot_loss: 5.6821 | rl_loss: 2.5853 | pt_loss: 3.0968 | pg_loss: 0.9442 | reg_loss: 1.6410 | reward: -1.1426 | rev_kl: 2.6260 | stu_lens: 68.4375 | mixed_lens: 59.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 133/ 160| tot_loss: 6.3715 | rl_loss: 3.2097 | pt_loss: 3.1617 | pg_loss: 0.7776 | reg_loss: 2.4321 | reward: -1.5186 | rev_kl: 1.6848 | stu_lens: 92.0625 | mixed_lens: 72.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 134/ 160| tot_loss: 6.7943 | rl_loss: 3.4563 | pt_loss: 3.3380 | pg_loss: 1.2914 | reg_loss: 2.1649 | reward: -1.4033 | rev_kl: 2.7789 | stu_lens: 61.7500 | mixed_lens: 59.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 135/ 160| tot_loss: 7.4445 | rl_loss: 4.1550 | pt_loss: 3.2895 | pg_loss: 1.2189 | reg_loss: 2.9361 | reward: -1.2891 | rev_kl: 2.4576 | stu_lens: 79.0625 | mixed_lens: 69.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 136/ 160| tot_loss: 7.1016 | rl_loss: 3.9315 | pt_loss: 3.1701 | pg_loss: 1.3390 | reg_loss: 2.5925 | reward: -1.3369 | rev_kl: 2.4021 | stu_lens: 76.1875 | mixed_lens: 65.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 137/ 160| tot_loss: 8.2712 | rl_loss: 4.9352 | pt_loss: 3.3359 | pg_loss: 1.9644 | reg_loss: 2.9708 | reward: -1.5801 | rev_kl: 2.3652 | stu_lens: 58.3125 | mixed_lens: 75.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 138/ 160| tot_loss: 5.7539 | rl_loss: 2.8656 | pt_loss: 2.8883 | pg_loss: 0.7364 | reg_loss: 2.1291 | reward: -0.5889 | rev_kl: 2.5494 | stu_lens: 90.8750 | mixed_lens: 67.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 139/ 160| tot_loss: 6.5031 | rl_loss: 3.8434 | pt_loss: 2.6597 | pg_loss: 1.2734 | reg_loss: 2.5700 | reward: -1.6953 | rev_kl: 1.6881 | stu_lens: 99.3125 | mixed_lens: 56.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 eval | rougeL: 24.310 | exact_match: 3.100 | rev_kl: 1.879 | lens: 70.543 | pt_loss: 3.021 | lm_loss: 3.473 | kd_loss: 2.569 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 140/ 160| tot_loss: 6.6860 | rl_loss: 3.4814 | pt_loss: 3.2046 | pg_loss: 0.8524 | reg_loss: 2.6290 | reward: -1.6836 | rev_kl: 2.7207 | stu_lens: 60.5625 | mixed_lens: 68.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.363 | step time: 0.363 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 141/ 160| tot_loss: 6.2668 | rl_loss: 3.1277 | pt_loss: 3.1390 | pg_loss: 0.6834 | reg_loss: 2.4444 | reward: -1.5801 | rev_kl: 1.7998 | stu_lens: 124.5625 | mixed_lens: 60.3750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 142/ 160| tot_loss: 6.6961 | rl_loss: 3.5944 | pt_loss: 3.1017 | pg_loss: 1.6158 | reg_loss: 1.9787 | reward: -1.8818 | rev_kl: 2.7080 | stu_lens: 40.5625 | mixed_lens: 45.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 143/ 160| tot_loss: 6.8511 | rl_loss: 3.7522 | pt_loss: 3.0990 | pg_loss: 1.1136 | reg_loss: 2.6385 | reward: -0.7070 | rev_kl: 2.4555 | stu_lens: 66.6250 | mixed_lens: 76.8125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 144/ 160| tot_loss: 6.9484 | rl_loss: 3.9231 | pt_loss: 3.0253 | pg_loss: 1.3164 | reg_loss: 2.6066 | reward: -1.3789 | rev_kl: 2.3601 | stu_lens: 77.3125 | mixed_lens: 84.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 145/ 160| tot_loss: 5.6491 | rl_loss: 2.6947 | pt_loss: 2.9544 | pg_loss: 0.7071 | reg_loss: 1.9876 | reward: -1.4395 | rev_kl: 2.0307 | stu_lens: 56.5000 | mixed_lens: 48.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 3.214 | step time: 3.214 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 146/ 160| tot_loss: 7.5079 | rl_loss: 4.1240 | pt_loss: 3.3838 | pg_loss: 1.5348 | reg_loss: 2.5892 | reward: -1.1895 | rev_kl: 2.0244 | stu_lens: 124.4375 | mixed_lens: 77.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 147/ 160| tot_loss: 4.8806 | rl_loss: 1.7228 | pt_loss: 3.1578 | pg_loss: 0.6120 | reg_loss: 1.1108 | reward: -0.9238 | rev_kl: 1.7947 | stu_lens: 77.2500 | mixed_lens: 82.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 148/ 160| tot_loss: 8.5711 | rl_loss: 5.4525 | pt_loss: 3.1186 | pg_loss: 3.0341 | reg_loss: 2.4184 | reward: -2.3066 | rev_kl: 3.0048 | stu_lens: 68.3125 | mixed_lens: 45.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 149/ 160| tot_loss: 5.6774 | rl_loss: 2.5344 | pt_loss: 3.1429 | pg_loss: 0.6879 | reg_loss: 1.8465 | reward: -1.2559 | rev_kl: 1.8338 | stu_lens: 78.3125 | mixed_lens: 72.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 150/ 160| tot_loss: 4.7362 | rl_loss: 1.9195 | pt_loss: 2.8167 | pg_loss: 0.7387 | reg_loss: 1.1808 | reward: -1.2725 | rev_kl: 2.1225 | stu_lens: 124.2500 | mixed_lens: 71.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 151/ 160| tot_loss: 7.6812 | rl_loss: 4.5739 | pt_loss: 3.1073 | pg_loss: 2.2049 | reg_loss: 2.3690 | reward: -2.2363 | rev_kl: 2.2422 | stu_lens: 54.5000 | mixed_lens: 33.8750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 152/ 160| tot_loss: 6.8422 | rl_loss: 3.4832 | pt_loss: 3.3590 | pg_loss: 1.2115 | reg_loss: 2.2717 | reward: -1.0957 | rev_kl: 2.6562 | stu_lens: 69.4375 | mixed_lens: 76.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 153/ 160| tot_loss: 4.6541 | rl_loss: 1.4562 | pt_loss: 3.1979 | pg_loss: 0.4877 | reg_loss: 0.9685 | reward: -1.6797 | rev_kl: 3.0440 | stu_lens: 47.3750 | mixed_lens: 62.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 154/ 160| tot_loss: 6.3939 | rl_loss: 3.1886 | pt_loss: 3.2053 | pg_loss: 1.2633 | reg_loss: 1.9253 | reward: -1.2373 | rev_kl: 1.8693 | stu_lens: 96.5625 | mixed_lens: 56.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 155/ 160| tot_loss: 6.1353 | rl_loss: 3.2053 | pt_loss: 2.9300 | pg_loss: 1.1344 | reg_loss: 2.0709 | reward: -1.0049 | rev_kl: 1.9559 | stu_lens: 111.3125 | mixed_lens: 89.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 156/ 160| tot_loss: 7.0041 | rl_loss: 3.8199 | pt_loss: 3.1842 | pg_loss: 1.5461 | reg_loss: 2.2738 | reward: -1.9375 | rev_kl: 1.9854 | stu_lens: 71.2500 | mixed_lens: 45.8125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 157/ 160| tot_loss: 7.0166 | rl_loss: 4.0733 | pt_loss: 2.9433 | pg_loss: 1.5655 | reg_loss: 2.5078 | reward: -1.6660 | rev_kl: 2.0529 | stu_lens: 96.2500 | mixed_lens: 69.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 158/ 160| tot_loss: 4.9263 | rl_loss: 1.7438 | pt_loss: 3.1825 | pg_loss: 0.5495 | reg_loss: 1.1943 | reward: -1.8271 | rev_kl: 2.2048 | stu_lens: 87.6250 | mixed_lens: 62.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 159/ 160| tot_loss: 6.7466 | rl_loss: 3.4755 | pt_loss: 3.2712 | pg_loss: 1.6377 | reg_loss: 1.8378 | reward: -0.9443 | rev_kl: 2.1583 | stu_lens: 80.3125 | mixed_lens: 66.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 eval | rougeL: 24.353 | exact_match: 3.100 | rev_kl: 1.964 | lens: 74.957 | pt_loss: 3.017 | lm_loss: 3.468 | kd_loss: 2.567 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 160/ 160| tot_loss: 5.7856 | rl_loss: 2.4925 | pt_loss: 3.2931 | pg_loss: 0.9332 | reg_loss: 1.5593 | reward: -1.4229 | rev_kl: 2.4387 | stu_lens: 62.3125 | mixed_lens: 56.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367

wutaiqiang commented 11 months ago

max roughL is 24.353, and when I evaluate the model on dolly, I get 22.40, which is far less than 24.6

wutaiqiang commented 11 months ago

@t1101675 could you please help to figure out where I am wrong? thanks

the cmd is:

! /bin/bash

MASTER_ADDR=localhost MASTER_PORT=${2-2012} NNODES=1 NODE_RANK=0 GPUS_PER_NODE=${3-4}

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT"

model

BASE_PATH=${1-"xxxx"} CKPT_NAME="base-init" CKPT="${BASE_PATH}/results/gpt2/train/minillm_init/gpt2-base" TEACHER_CKPT_NAME="xlarge-sft" TEACHER_CKPT="${BASE_PATH}/results/gpt2/train/sft/gpt2-xlarge/"

data

PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/" LM_DATA_DIR="${BASE_PATH}/processed_data/openwebtext/gpt2/512/10M/"

runtime

SAVE_PATH="${BASE_PATH}/results/gpt2/train/minillm/"

hp

GRAD_ACC=1 BATCH_SIZE=4 CHUNK_SIZE=16

OPTS=""

model

OPTS+=" --base-path ${BASE_PATH}" OPTS+=" --model-path ${CKPT}" OPTS+=" --teacher-model-path ${TEACHER_CKPT}" OPTS+=" --ckpt-name ${CKPT_NAME}" OPTS+=" --teacher-ckpt-name ${TEACHER_CKPT_NAME}" OPTS+=" --n-gpu ${GPUS_PER_NODE}" OPTS+=" --teacher-model-fp16"

OPTS+=" --gradient-checkpointing"

data

OPTS+=" --prompt-data-dir ${PROMPT_DATA_DIR}" OPTS+=" --lm-data-dir ${LM_DATA_DIR}" OPTS+=" --dev-num 1000" OPTS+=" --num-workers 0"

hp

OPTS+=" --epochs 10" OPTS+=" --total-iters 5000" OPTS+=" --kd-ratio 0.5" OPTS+=" --batch-size ${BATCH_SIZE}" OPTS+=" --lr 5e-6" OPTS+=" --lr-min 5e-6" OPTS+=" --gradient-accumulation-steps ${GRAD_ACC}" OPTS+=" --max-length 512" OPTS+=" --max-prompt-length 256" OPTS+=" --warmup-iters 100"

runtime

OPTS+=" --save ${SAVE_PATH}" OPTS+=" --seed 10" OPTS+=" --seed-ppo 42" OPTS+=" --seed-lm 7" OPTS+=" --save-interval 100" OPTS+=" --eval-interval 20" OPTS+=" --log-interval 1" OPTS+=" --mid-log-num 1"

ppo

OPTS+=" --type minillm" OPTS+=" --ppo-epochs 4" OPTS+=" --num-rollouts 16" OPTS+=" --chunk-size ${CHUNK_SIZE}"

minillm

OPTS+=" --length-norm" OPTS+=" --single-step-reg" OPTS+=" --teacher-mixed-alpha 0.2"

reward

OPTS+=" --reward-scaling 0.5" OPTS+=" --cliprange-reward 100"

gen

OPTS+=" --do-sample" OPTS+=" --top-k 0" OPTS+=" --top-p 1.0" OPTS+=" --temperature 1.0"

deepspeed

OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config.json"

export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=3 export PYTHONPATH=${BASE_PATH} CMD="torchrun ${DISTRIBUTED_ARGS} ${BASE_PATH}/train_minillm.py ${OPTS} $@"

echo ${CMD} echo "PYTHONPATH=${PYTHONPATH}" mkdir -p ${SAVE_PATH} ${CMD}

wutaiqiang commented 11 months ago

Change the batch size to 16, so that 4 GPU * 16= 64, then we get 19.82 on the dolly, a worse result.

t1101675 commented 11 months ago
  1. The --epochs should be larger (> 300) and you will get the total global iters = 5000 in the log
  2. The --num-rollouts should be larger (num-rollouts * num-gpus should be 256)
  3. You can also increase --chunk-size for more efficient training.

We have updated our code for the convenience to set these hyper-parameters. You can ignore 1 & 2 if you use current scripts.

wutaiqiang commented 11 months ago

thanks Let me try the new hyper-parameter

wutaiqiang commented 11 months ago

thanks for your kind help. On dolly, mine ~= paper. On UnNI and selfInst and S-NI mine < paper. On Vicuna mine > paper. I think the seeds may cause it.

Thanks again. BTW, In pt_loss, it is

(1-self.args.kd_ratio) lm_loss + self.args.kd_ratio distil_loss

and the distil_loss is:

teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32) inf_mask = torch.isinf(logits) logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32) prod_probs = torch.masked_fill(teacher_probs logprobs, inf_mask, 0) x = torch.sum(prod_probs, dim=-1).view(-1) distil_loss = -torch.sum(x loss_mask.view(-1), dim=0) / torch.sum(loss_mask.view(-1), dim=0) #! div 0 风险

This distil_loss is forward KL. Meanwhile, I noticed that you use the get_rev_kl function in sampler.py, which is reverse KL. Could you please help figure out why two distill losses there?

thanks very much!

t1101675 commented 11 months ago

The forward KL loss works as a regularization to prevent the model from collapsing to a single mode when using reverse KL. The effect of this loss is controlled by args.kd_ratio and is optional because we find that the result does not change much with different args.kd_ratio (kd_ratio=0 means only using lm_loss).

AInkCode commented 11 months ago

@wutaiqiang @t1101675 hi, bro! In which file can I set the parameters of the distillation model, such as distill gpt2-1.5B -> gpt2-120M? Thanks?

wutaiqiang commented 11 months ago

https://github.com/microsoft/LMOps/blob/38f67def245e61cdbcf06c61ad98c030c6f2f505/minillm/scripts/gpt2/minillm/train_base_xl.sh

@AInkCode