Closed wutaiqiang closed 11 months ago
Then I get the log:
eval | rougeL: 21.200 | exact_match: 2.600 | rev_kl: 2.443 | lens: 58.775 | pt_loss: 3.014 | lm_loss: 3.430 | kd_loss: 2.598 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 1/ 160| tot_loss: 6.6411 | rl_loss: 3.4057 | pt_loss: 3.2354 | pg_loss: 1.6204 | reg_loss: 1.7853 | reward: -1.3770 | rev_kl: 1.8445 | stu_lens: 36.8125 | mixed_lens: 56.6875 | lr: 5.0000e-08 | scale: 2048.00 | time: 0.501 | step time: 0.501 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 2/ 160| tot_loss: 7.3838 | rl_loss: 4.4417 | pt_loss: 2.9421 | pg_loss: 1.1692 | reg_loss: 3.2725 | reward: -2.1406 | rev_kl: 2.9407 | stu_lens: 36.2500 | mixed_lens: 43.1250 | lr: 5.0000e-08 | scale: 2048.00 | time: 0.493 | step time: 0.493 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 3/ 160| tot_loss: 6.9746 | rl_loss: 3.9665 | pt_loss: 3.0081 | pg_loss: 0.8613 | reg_loss: 3.1053 | reward: -2.0117 | rev_kl: 2.8732 | stu_lens: 90.2500 | mixed_lens: 59.3125 | lr: 1.0000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 4/ 160| tot_loss: 6.4336 | rl_loss: 3.7142 | pt_loss: 2.7194 | pg_loss: 0.9492 | reg_loss: 2.7650 | reward: -2.2148 | rev_kl: 2.9595 | stu_lens: 44.6875 | mixed_lens: 58.9375 | lr: 1.5000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 5/ 160| tot_loss: 4.7745 | rl_loss: 1.7151 | pt_loss: 3.0595 | pg_loss: 0.6897 | reg_loss: 1.0254 | reward: -2.3086 | rev_kl: 2.4704 | stu_lens: 41.1875 | mixed_lens: 50.1875 | lr: 2.0000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 6/ 160| tot_loss: 8.4113 | rl_loss: 5.2663 | pt_loss: 3.1450 | pg_loss: 2.2961 | reg_loss: 2.9702 | reward: -1.4766 | rev_kl: 2.3114 | stu_lens: 67.1250 | mixed_lens: 56.1250 | lr: 2.0000e-07 | scale: 2048.00 | time: 0.360 | step time: 0.360 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 7/ 160| tot_loss: 7.3991 | rl_loss: 4.3857 | pt_loss: 3.0135 | pg_loss: 1.0776 | reg_loss: 3.3081 | reward: -1.8721 | rev_kl: 2.6794 | stu_lens: 47.4375 | mixed_lens: 63.8750 | lr: 2.5000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 8/ 160| tot_loss: 6.8751 | rl_loss: 3.5853 | pt_loss: 3.2898 | pg_loss: 0.7668 | reg_loss: 2.8186 | reward: -2.0840 | rev_kl: 3.1566 | stu_lens: 52.2500 | mixed_lens: 47.8750 | lr: 3.0000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 9/ 160| tot_loss: 5.8591 | rl_loss: 2.6858 | pt_loss: 3.1733 | pg_loss: 0.8090 | reg_loss: 1.8767 | reward: -1.3281 | rev_kl: 2.4784 | stu_lens: 55.5625 | mixed_lens: 49.1250 | lr: 3.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 10/ 160| tot_loss: 7.6280 | rl_loss: 4.5530 | pt_loss: 3.0750 | pg_loss: 1.2175 | reg_loss: 3.3355 | reward: -2.3711 | rev_kl: 2.9943 | stu_lens: 52.1875 | mixed_lens: 60.1250 | lr: 4.0000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 11/ 160| tot_loss: 8.0311 | rl_loss: 4.8438 | pt_loss: 3.1873 | pg_loss: 1.5827 | reg_loss: 3.2611 | reward: -1.7314 | rev_kl: 2.2585 | stu_lens: 34.9375 | mixed_lens: 44.1875 | lr: 4.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 12/ 160| tot_loss: 7.4314 | rl_loss: 4.3438 | pt_loss: 3.0876 | pg_loss: 1.1035 | reg_loss: 3.2402 | reward: -2.3125 | rev_kl: 2.8866 | stu_lens: 65.3125 | mixed_lens: 64.6250 | lr: 4.5000e-07 | scale: 2048.00 | time: 0.357 | step time: 0.357 train | epoch 0 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 13/ 160| tot_loss: 6.0105 | rl_loss: 2.9010 | pt_loss: 3.1095 | pg_loss: 0.7505 | reg_loss: 2.1505 | reward: -1.8770 | rev_kl: 2.5659 | stu_lens: 73.8750 | mixed_lens: 66.3750 | lr: 5.0000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 0 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 14/ 160| tot_loss: 7.8888 | rl_loss: 4.7281 | pt_loss: 3.1607 | pg_loss: 1.4379 | reg_loss: 3.2901 | reward: -1.6855 | rev_kl: 2.4188 | stu_lens: 33.4375 | mixed_lens: 50.4375 | lr: 5.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 0 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 15/ 160| tot_loss: 6.9703 | rl_loss: 3.8623 | pt_loss: 3.1080 | pg_loss: 1.1477 | reg_loss: 2.7146 | reward: -2.4434 | rev_kl: 2.8736 | stu_lens: 60.0625 | mixed_lens: 53.5000 | lr: 6.0000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 0 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 16/ 160| tot_loss: 7.3526 | rl_loss: 4.0999 | pt_loss: 3.2527 | pg_loss: 1.0451 | reg_loss: 3.0547 | reward: -1.7373 | rev_kl: 2.7595 | stu_lens: 40.6250 | mixed_lens: 47.7500 | lr: 6.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 17/ 160| tot_loss: 6.5685 | rl_loss: 3.4502 | pt_loss: 3.1183 | pg_loss: 0.7241 | reg_loss: 2.7262 | reward: -2.3555 | rev_kl: 2.2575 | stu_lens: 43.0000 | mixed_lens: 66.9375 | lr: 7.0000e-07 | scale: 2048.00 | time: 2.041 | step time: 2.041 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 18/ 160| tot_loss: 6.6946 | rl_loss: 3.5981 | pt_loss: 3.0966 | pg_loss: 1.5656 | reg_loss: 2.0325 | reward: -2.7012 | rev_kl: 2.4096 | stu_lens: 47.0000 | mixed_lens: 43.9375 | lr: 7.5000e-07 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 19/ 160| tot_loss: 6.4938 | rl_loss: 3.5470 | pt_loss: 2.9468 | pg_loss: 1.2713 | reg_loss: 2.2757 | reward: -1.4473 | rev_kl: 3.6484 | stu_lens: 92.0625 | mixed_lens: 81.1250 | lr: 8.0000e-07 | scale: 2048.00 | time: 0.365 | step time: 0.365 eval | rougeL: 21.269 | exact_match: 3.100 | rev_kl: 2.411 | lens: 58.546 | pt_loss: 3.014 | lm_loss: 3.430 | kd_loss: 2.597 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 20/ 160| tot_loss: 5.8548 | rl_loss: 2.6765 | pt_loss: 3.1783 | pg_loss: 0.4331 | reg_loss: 2.2434 | reward: -1.2764 | rev_kl: 2.6105 | stu_lens: 74.5625 | mixed_lens: 65.2500 | lr: 8.5000e-07 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 21/ 160| tot_loss: 6.9614 | rl_loss: 3.6870 | pt_loss: 3.2743 | pg_loss: 0.8843 | reg_loss: 2.8028 | reward: -2.1367 | rev_kl: 2.5854 | stu_lens: 53.1875 | mixed_lens: 60.6250 | lr: 9.0000e-07 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 22/ 160| tot_loss: 6.1811 | rl_loss: 3.2499 | pt_loss: 2.9312 | pg_loss: 0.8077 | reg_loss: 2.4423 | reward: -0.9253 | rev_kl: 2.1572 | stu_lens: 66.3750 | mixed_lens: 81.0000 | lr: 9.5000e-07 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 23/ 160| tot_loss: 5.1112 | rl_loss: 2.0210 | pt_loss: 3.0902 | pg_loss: 0.3897 | reg_loss: 1.6313 | reward: -1.9258 | rev_kl: 2.7548 | stu_lens: 78.8125 | mixed_lens: 62.5625 | lr: 1.0000e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 24/ 160| tot_loss: 6.7964 | rl_loss: 3.7504 | pt_loss: 3.0460 | pg_loss: 1.0821 | reg_loss: 2.6683 | reward: -2.7910 | rev_kl: 3.4285 | stu_lens: 58.2500 | mixed_lens: 53.0625 | lr: 1.0500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 25/ 160| tot_loss: 6.5395 | rl_loss: 3.4161 | pt_loss: 3.1233 | pg_loss: 0.7052 | reg_loss: 2.7109 | reward: -1.4795 | rev_kl: 2.3824 | stu_lens: 52.5000 | mixed_lens: 82.5000 | lr: 1.1000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 26/ 160| tot_loss: 6.3982 | rl_loss: 3.0625 | pt_loss: 3.3358 | pg_loss: 0.6713 | reg_loss: 2.3911 | reward: -1.3428 | rev_kl: 2.6665 | stu_lens: 94.7500 | mixed_lens: 75.9375 | lr: 1.1500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 27/ 160| tot_loss: 5.5353 | rl_loss: 2.4059 | pt_loss: 3.1294 | pg_loss: 0.4697 | reg_loss: 1.9362 | reward: -2.4395 | rev_kl: 3.1394 | stu_lens: 69.0000 | mixed_lens: 57.8125 | lr: 1.2000e-06 | scale: 2048.00 | time: 0.371 | step time: 0.371 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 28/ 160| tot_loss: 8.0861 | rl_loss: 4.7915 | pt_loss: 3.2946 | pg_loss: 2.2176 | reg_loss: 2.5739 | reward: -2.5176 | rev_kl: 2.7376 | stu_lens: 40.3750 | mixed_lens: 41.0000 | lr: 1.2500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 1 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 29/ 160| tot_loss: 6.5176 | rl_loss: 3.3472 | pt_loss: 3.1705 | pg_loss: 0.7683 | reg_loss: 2.5789 | reward: -1.5840 | rev_kl: 2.6401 | stu_lens: 75.7500 | mixed_lens: 59.9375 | lr: 1.3000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 1 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 30/ 160| tot_loss: 6.6756 | rl_loss: 3.7310 | pt_loss: 2.9446 | pg_loss: 0.9884 | reg_loss: 2.7426 | reward: -2.5059 | rev_kl: 2.5335 | stu_lens: 58.5625 | mixed_lens: 63.5000 | lr: 1.3500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 1 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 31/ 160| tot_loss: 6.0135 | rl_loss: 2.8948 | pt_loss: 3.1187 | pg_loss: 0.6172 | reg_loss: 2.2776 | reward: -1.7734 | rev_kl: 2.6676 | stu_lens: 63.7500 | mixed_lens: 67.0000 | lr: 1.4000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 1 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 32/ 160| tot_loss: 5.7564 | rl_loss: 2.4033 | pt_loss: 3.3532 | pg_loss: 0.7497 | reg_loss: 1.6536 | reward: -1.9160 | rev_kl: 3.0847 | stu_lens: 58.5625 | mixed_lens: 66.8125 | lr: 1.4500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 33/ 160| tot_loss: 7.4187 | rl_loss: 4.3375 | pt_loss: 3.0811 | pg_loss: 1.6900 | reg_loss: 2.6475 | reward: -1.6006 | rev_kl: 1.8722 | stu_lens: 51.9375 | mixed_lens: 56.6875 | lr: 1.5000e-06 | scale: 2048.00 | time: 0.596 | step time: 0.596 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 34/ 160| tot_loss: 6.8117 | rl_loss: 3.7359 | pt_loss: 3.0758 | pg_loss: 0.8222 | reg_loss: 2.9137 | reward: -1.2988 | rev_kl: 2.2187 | stu_lens: 68.2500 | mixed_lens: 71.1875 | lr: 1.5500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 35/ 160| tot_loss: 6.7694 | rl_loss: 3.6155 | pt_loss: 3.1540 | pg_loss: 0.7880 | reg_loss: 2.8275 | reward: -1.1836 | rev_kl: 2.4549 | stu_lens: 62.8125 | mixed_lens: 78.6875 | lr: 1.6000e-06 | scale: 2048.00 | time: 0.369 | step time: 0.369 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 36/ 160| tot_loss: 7.3283 | rl_loss: 4.2418 | pt_loss: 3.0865 | pg_loss: 1.0094 | reg_loss: 3.2323 | reward: -1.7773 | rev_kl: 2.3933 | stu_lens: 98.4375 | mixed_lens: 64.7500 | lr: 1.6500e-06 | scale: 2048.00 | time: 0.369 | step time: 0.369 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 37/ 160| tot_loss: 7.0674 | rl_loss: 3.9013 | pt_loss: 3.1661 | pg_loss: 1.2715 | reg_loss: 2.6297 | reward: -1.2119 | rev_kl: 2.0438 | stu_lens: 77.8750 | mixed_lens: 48.2500 | lr: 1.7000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 38/ 160| tot_loss: 8.3326 | rl_loss: 5.2152 | pt_loss: 3.1175 | pg_loss: 2.4236 | reg_loss: 2.7915 | reward: -1.5059 | rev_kl: 2.1339 | stu_lens: 50.8750 | mixed_lens: 53.8125 | lr: 1.7500e-06 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 39/ 160| tot_loss: 6.6875 | rl_loss: 3.9191 | pt_loss: 2.7684 | pg_loss: 1.0256 | reg_loss: 2.8936 | reward: -1.3027 | rev_kl: 2.5136 | stu_lens: 101.9375 | mixed_lens: 81.1250 | lr: 1.8000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 eval | rougeL: 21.167 | exact_match: 2.800 | rev_kl: 2.446 | lens: 65.074 | pt_loss: 3.013 | lm_loss: 3.432 | kd_loss: 2.593 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 40/ 160| tot_loss: 6.4726 | rl_loss: 3.4598 | pt_loss: 3.0127 | pg_loss: 0.4929 | reg_loss: 2.9669 | reward: -1.8398 | rev_kl: 2.2477 | stu_lens: 50.7500 | mixed_lens: 88.1250 | lr: 1.8500e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 41/ 160| tot_loss: 6.7486 | rl_loss: 3.9185 | pt_loss: 2.8302 | pg_loss: 1.1422 | reg_loss: 2.7763 | reward: -1.9473 | rev_kl: 2.3467 | stu_lens: 60.8750 | mixed_lens: 60.3750 | lr: 1.9000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 42/ 160| tot_loss: 7.0553 | rl_loss: 4.0452 | pt_loss: 3.0101 | pg_loss: 0.9356 | reg_loss: 3.1096 | reward: -1.9658 | rev_kl: 2.9078 | stu_lens: 93.1875 | mixed_lens: 70.7500 | lr: 1.9500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 43/ 160| tot_loss: 6.1622 | rl_loss: 3.1271 | pt_loss: 3.0351 | pg_loss: 0.5073 | reg_loss: 2.6197 | reward: -0.7446 | rev_kl: 1.7225 | stu_lens: 54.8750 | mixed_lens: 78.5625 | lr: 2.0000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 44/ 160| tot_loss: 7.6874 | rl_loss: 4.6414 | pt_loss: 3.0460 | pg_loss: 1.5165 | reg_loss: 3.1249 | reward: -1.2021 | rev_kl: 1.9620 | stu_lens: 72.5000 | mixed_lens: 61.6250 | lr: 2.0500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 2 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 45/ 160| tot_loss: 7.1404 | rl_loss: 3.9895 | pt_loss: 3.1509 | pg_loss: 1.1287 | reg_loss: 2.8608 | reward: -1.3838 | rev_kl: 2.0007 | stu_lens: 45.5625 | mixed_lens: 51.0625 | lr: 2.1000e-06 | scale: 2048.00 | time: 0.370 | step time: 0.370 train | epoch 2 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 46/ 160| tot_loss: 6.4918 | rl_loss: 3.2966 | pt_loss: 3.1952 | pg_loss: 0.4387 | reg_loss: 2.8580 | reward: -1.7373 | rev_kl: 2.2316 | stu_lens: 88.8750 | mixed_lens: 69.1875 | lr: 2.1500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 2 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 47/ 160| tot_loss: 7.1947 | rl_loss: 4.3234 | pt_loss: 2.8713 | pg_loss: 1.4248 | reg_loss: 2.8986 | reward: -1.1377 | rev_kl: 2.2990 | stu_lens: 89.5625 | mixed_lens: 80.9375 | lr: 2.2000e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 train | epoch 2 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 48/ 160| tot_loss: 7.4945 | rl_loss: 4.0246 | pt_loss: 3.4700 | pg_loss: 1.1872 | reg_loss: 2.8374 | reward: -1.6016 | rev_kl: 2.4077 | stu_lens: 57.4375 | mixed_lens: 70.1250 | lr: 2.2500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 49/ 160| tot_loss: 6.2650 | rl_loss: 3.1572 | pt_loss: 3.1078 | pg_loss: 0.5124 | reg_loss: 2.6448 | reward: -2.0586 | rev_kl: 2.3399 | stu_lens: 76.0625 | mixed_lens: 57.7500 | lr: 2.3000e-06 | scale: 2048.00 | time: 0.645 | step time: 0.645 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 50/ 160| tot_loss: 6.8729 | rl_loss: 3.9255 | pt_loss: 2.9474 | pg_loss: 1.4219 | reg_loss: 2.5036 | reward: -1.5967 | rev_kl: 3.5179 | stu_lens: 74.5000 | mixed_lens: 72.9375 | lr: 2.3500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 51/ 160| tot_loss: 5.9044 | rl_loss: 2.8643 | pt_loss: 3.0401 | pg_loss: 0.5225 | reg_loss: 2.3418 | reward: -2.4766 | rev_kl: 2.5543 | stu_lens: 42.3750 | mixed_lens: 55.2500 | lr: 2.4000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 52/ 160| tot_loss: 6.9690 | rl_loss: 3.9482 | pt_loss: 3.0209 | pg_loss: 1.5998 | reg_loss: 2.3483 | reward: -1.6885 | rev_kl: 3.1211 | stu_lens: 44.8750 | mixed_lens: 47.7500 | lr: 2.4500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 53/ 160| tot_loss: 5.0386 | rl_loss: 2.0322 | pt_loss: 3.0064 | pg_loss: 0.7039 | reg_loss: 1.3283 | reward: -1.6348 | rev_kl: 2.8166 | stu_lens: 58.8750 | mixed_lens: 62.1875 | lr: 2.5000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 54/ 160| tot_loss: 5.6647 | rl_loss: 2.4420 | pt_loss: 3.2227 | pg_loss: 0.6343 | reg_loss: 1.8076 | reward: -2.1270 | rev_kl: 3.2066 | stu_lens: 62.1250 | mixed_lens: 47.9375 | lr: 2.5500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 55/ 160| tot_loss: 6.9695 | rl_loss: 3.9436 | pt_loss: 3.0259 | pg_loss: 1.2622 | reg_loss: 2.6814 | reward: -2.0723 | rev_kl: 2.5787 | stu_lens: 50.4375 | mixed_lens: 51.5625 | lr: 2.6000e-06 | scale: 2048.00 | time: 0.363 | step time: 0.363 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 56/ 160| tot_loss: 6.6155 | rl_loss: 3.4524 | pt_loss: 3.1631 | pg_loss: 0.6906 | reg_loss: 2.7618 | reward: -1.9854 | rev_kl: 2.9313 | stu_lens: 66.3750 | mixed_lens: 72.0000 | lr: 2.6500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 57/ 160| tot_loss: 5.0205 | rl_loss: 1.8792 | pt_loss: 3.1413 | pg_loss: 0.3246 | reg_loss: 1.5546 | reward: -2.3691 | rev_kl: 2.1796 | stu_lens: 61.6875 | mixed_lens: 40.7500 | lr: 2.7000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 58/ 160| tot_loss: 7.2452 | rl_loss: 4.2972 | pt_loss: 2.9479 | pg_loss: 1.8306 | reg_loss: 2.4666 | reward: -1.8516 | rev_kl: 3.0155 | stu_lens: 73.3125 | mixed_lens: 65.4375 | lr: 2.7500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 59/ 160| tot_loss: 5.7360 | rl_loss: 2.8458 | pt_loss: 2.8902 | pg_loss: 0.6121 | reg_loss: 2.2337 | reward: -1.3867 | rev_kl: 2.4098 | stu_lens: 68.0000 | mixed_lens: 56.3750 | lr: 2.8000e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 eval | rougeL: 21.786 | exact_match: 2.800 | rev_kl: 2.304 | lens: 65.498 | pt_loss: 3.012 | lm_loss: 3.439 | kd_loss: 2.584 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 60/ 160| tot_loss: 7.2182 | rl_loss: 3.8136 | pt_loss: 3.4046 | pg_loss: 0.9123 | reg_loss: 2.9013 | reward: -2.2109 | rev_kl: 3.9282 | stu_lens: 34.8125 | mixed_lens: 71.1250 | lr: 2.8500e-06 | scale: 2048.00 | time: 0.371 | step time: 0.371 train | epoch 3 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 61/ 160| tot_loss: 6.2466 | rl_loss: 3.0557 | pt_loss: 3.1909 | pg_loss: 1.1088 | reg_loss: 1.9469 | reward: -2.2109 | rev_kl: 3.3022 | stu_lens: 48.6875 | mixed_lens: 60.5625 | lr: 2.9000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 3 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 62/ 160| tot_loss: 6.3855 | rl_loss: 3.3054 | pt_loss: 3.0801 | pg_loss: 0.8967 | reg_loss: 2.4086 | reward: -1.7705 | rev_kl: 2.4576 | stu_lens: 55.5000 | mixed_lens: 60.6875 | lr: 2.9500e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 3 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 63/ 160| tot_loss: 6.6113 | rl_loss: 3.3963 | pt_loss: 3.2150 | pg_loss: 0.6002 | reg_loss: 2.7961 | reward: -1.7734 | rev_kl: 2.8392 | stu_lens: 60.0000 | mixed_lens: 55.5625 | lr: 3.0000e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 3 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 64/ 160| tot_loss: 6.0692 | rl_loss: 2.8235 | pt_loss: 3.2457 | pg_loss: 0.9977 | reg_loss: 1.8258 | reward: -2.0645 | rev_kl: 2.9342 | stu_lens: 73.6250 | mixed_lens: 56.8750 | lr: 3.0500e-06 | scale: 2048.00 | time: 0.367 | step time: 0.367 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 65/ 160| tot_loss: 5.5424 | rl_loss: 2.2079 | pt_loss: 3.3345 | pg_loss: 1.2006 | reg_loss: 1.0073 | reward: -1.7334 | rev_kl: 2.3148 | stu_lens: 81.7500 | mixed_lens: 51.9375 | lr: 3.1000e-06 | scale: 2048.00 | time: 0.411 | step time: 0.411 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 66/ 160| tot_loss: 6.4153 | rl_loss: 3.2441 | pt_loss: 3.1712 | pg_loss: 0.6562 | reg_loss: 2.5879 | reward: -1.0107 | rev_kl: 2.7347 | stu_lens: 69.0000 | mixed_lens: 75.8750 | lr: 3.1500e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 67/ 160| tot_loss: 6.8531 | rl_loss: 3.6034 | pt_loss: 3.2498 | pg_loss: 0.7934 | reg_loss: 2.8099 | reward: -2.5996 | rev_kl: 2.8171 | stu_lens: 90.5625 | mixed_lens: 93.6250 | lr: 3.2000e-06 | scale: 2048.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 68/ 160| tot_loss: 6.7168 | rl_loss: 3.3578 | pt_loss: 3.3590 | pg_loss: 0.6544 | reg_loss: 2.7034 | reward: -2.1895 | rev_kl: 2.2260 | stu_lens: 61.1250 | mixed_lens: 73.5625 | lr: 3.2500e-06 | scale: 2048.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 69/ 160| tot_loss: 6.2529 | rl_loss: 3.2390 | pt_loss: 3.0139 | pg_loss: 0.4003 | reg_loss: 2.8388 | reward: -1.2725 | rev_kl: 2.6633 | stu_lens: 78.4375 | mixed_lens: 86.6875 | lr: 3.3000e-06 | scale: 2048.00 | time: 0.366 | step time: 0.366 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 70/ 160| tot_loss: 6.9206 | rl_loss: 3.6456 | pt_loss: 3.2750 | pg_loss: 0.8808 | reg_loss: 2.7648 | reward: -2.6641 | rev_kl: 2.9559 | stu_lens: 78.8125 | mixed_lens: 87.7500 | lr: 3.3500e-06 | scale: 2048.00 | time: 0.365 | step time: 0.365 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 71/ 160| tot_loss: 6.2344 | rl_loss: 3.5199 | pt_loss: 2.7145 | pg_loss: 1.8389 | reg_loss: 1.6810 | reward: -1.7363 | rev_kl: 1.9849 | stu_lens: 59.8750 | mixed_lens: 52.6250 | lr: 3.3500e-06 | scale: 1024.00 | time: 0.358 | step time: 0.358 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 72/ 160| tot_loss: 5.2262 | rl_loss: 2.2246 | pt_loss: 3.0016 | pg_loss: 0.7458 | reg_loss: 1.4788 | reward: -1.8594 | rev_kl: 2.4886 | stu_lens: 85.3125 | mixed_lens: 67.9375 | lr: 3.4000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 73/ 160| tot_loss: 6.0056 | rl_loss: 2.9609 | pt_loss: 3.0447 | pg_loss: 0.6945 | reg_loss: 2.2664 | reward: -1.7812 | rev_kl: 2.6694 | stu_lens: 57.3125 | mixed_lens: 61.8125 | lr: 3.4500e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 74/ 160| tot_loss: 6.3009 | rl_loss: 3.4539 | pt_loss: 2.8470 | pg_loss: 0.6392 | reg_loss: 2.8147 | reward: -0.7598 | rev_kl: 2.4806 | stu_lens: 133.7500 | mixed_lens: 128.8750 | lr: 3.5000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 75/ 160| tot_loss: 5.6579 | rl_loss: 2.7076 | pt_loss: 2.9502 | pg_loss: 0.5680 | reg_loss: 2.1396 | reward: -1.4854 | rev_kl: 2.4697 | stu_lens: 56.9375 | mixed_lens: 62.0000 | lr: 3.5500e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 76/ 160| tot_loss: 7.6500 | rl_loss: 4.4674 | pt_loss: 3.1826 | pg_loss: 2.0777 | reg_loss: 2.3897 | reward: -3.5078 | rev_kl: 2.4729 | stu_lens: 54.4375 | mixed_lens: 42.3125 | lr: 3.6000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 4 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 77/ 160| tot_loss: 5.7953 | rl_loss: 2.7314 | pt_loss: 3.0639 | pg_loss: 0.3760 | reg_loss: 2.3554 | reward: -2.2715 | rev_kl: 2.8485 | stu_lens: 52.9375 | mixed_lens: 60.3125 | lr: 3.6500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 4 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 78/ 160| tot_loss: 5.6848 | rl_loss: 2.4691 | pt_loss: 3.2157 | pg_loss: 0.3626 | reg_loss: 2.1065 | reward: -0.7622 | rev_kl: 2.3950 | stu_lens: 96.0625 | mixed_lens: 104.3125 | lr: 3.7000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 4 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 79/ 160| tot_loss: 7.3166 | rl_loss: 4.4231 | pt_loss: 2.8935 | pg_loss: 1.5328 | reg_loss: 2.8902 | reward: -2.8418 | rev_kl: 2.4538 | stu_lens: 58.6875 | mixed_lens: 46.1875 | lr: 3.7500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 eval | rougeL: 23.098 | exact_match: 3.300 | rev_kl: 2.128 | lens: 74.373 | pt_loss: 3.014 | lm_loss: 3.450 | kd_loss: 2.578 train | epoch 4 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 80/ 160| tot_loss: 6.5506 | rl_loss: 3.5699 | pt_loss: 2.9807 | pg_loss: 1.1578 | reg_loss: 2.4121 | reward: -1.6592 | rev_kl: 2.3953 | stu_lens: 94.7500 | mixed_lens: 84.1875 | lr: 3.8000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 81/ 160| tot_loss: 7.0416 | rl_loss: 3.6395 | pt_loss: 3.4021 | pg_loss: 0.8753 | reg_loss: 2.7642 | reward: -1.7236 | rev_kl: 2.4504 | stu_lens: 32.3125 | mixed_lens: 62.1875 | lr: 3.8500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 82/ 160| tot_loss: 5.4813 | rl_loss: 2.2122 | pt_loss: 3.2690 | pg_loss: 0.8378 | reg_loss: 1.3745 | reward: -1.8965 | rev_kl: 2.2782 | stu_lens: 61.3125 | mixed_lens: 55.1875 | lr: 3.9000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 83/ 160| tot_loss: 6.3257 | rl_loss: 3.0270 | pt_loss: 3.2987 | pg_loss: 0.4682 | reg_loss: 2.5588 | reward: -0.9775 | rev_kl: 2.3965 | stu_lens: 108.2500 | mixed_lens: 90.5000 | lr: 3.9500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 84/ 160| tot_loss: 5.7535 | rl_loss: 3.1597 | pt_loss: 2.5938 | pg_loss: 0.7851 | reg_loss: 2.3746 | reward: -2.6172 | rev_kl: 1.7460 | stu_lens: 70.6250 | mixed_lens: 49.1875 | lr: 4.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 85/ 160| tot_loss: 4.5958 | rl_loss: 1.4772 | pt_loss: 3.1186 | pg_loss: 0.6324 | reg_loss: 0.8448 | reward: -2.4883 | rev_kl: 1.7702 | stu_lens: 32.5000 | mixed_lens: 46.4375 | lr: 4.0500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 86/ 160| tot_loss: 6.3875 | rl_loss: 3.2992 | pt_loss: 3.0883 | pg_loss: 0.6216 | reg_loss: 2.6776 | reward: -1.4160 | rev_kl: 2.6033 | stu_lens: 98.3125 | mixed_lens: 82.4375 | lr: 4.1000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 87/ 160| tot_loss: 6.8292 | rl_loss: 3.8961 | pt_loss: 2.9330 | pg_loss: 1.0610 | reg_loss: 2.8351 | reward: -0.9526 | rev_kl: 2.0824 | stu_lens: 78.8750 | mixed_lens: 55.8750 | lr: 4.1500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 88/ 160| tot_loss: 6.0350 | rl_loss: 2.7158 | pt_loss: 3.3192 | pg_loss: 0.8140 | reg_loss: 1.9018 | reward: -2.3574 | rev_kl: 2.4151 | stu_lens: 62.8125 | mixed_lens: 72.3125 | lr: 4.2000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 89/ 160| tot_loss: 5.8060 | rl_loss: 2.8232 | pt_loss: 2.9828 | pg_loss: 0.4847 | reg_loss: 2.3384 | reward: -1.4824 | rev_kl: 2.2524 | stu_lens: 80.6875 | mixed_lens: 63.3125 | lr: 4.2500e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 90/ 160| tot_loss: 6.7087 | rl_loss: 3.6483 | pt_loss: 3.0604 | pg_loss: 0.9358 | reg_loss: 2.7125 | reward: -0.9570 | rev_kl: 2.4727 | stu_lens: 67.1875 | mixed_lens: 70.8750 | lr: 4.3000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 91/ 160| tot_loss: 6.1429 | rl_loss: 2.8992 | pt_loss: 3.2437 | pg_loss: 0.6564 | reg_loss: 2.2427 | reward: -1.2529 | rev_kl: 2.2809 | stu_lens: 71.3125 | mixed_lens: 84.2500 | lr: 4.3500e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 92/ 160| tot_loss: 6.6836 | rl_loss: 3.0124 | pt_loss: 3.6713 | pg_loss: 0.8500 | reg_loss: 2.1624 | reward: -3.5215 | rev_kl: 1.8651 | stu_lens: 53.3125 | mixed_lens: 38.6250 | lr: 4.4000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 93/ 160| tot_loss: 6.1867 | rl_loss: 3.1526 | pt_loss: 3.0341 | pg_loss: 0.8937 | reg_loss: 2.2589 | reward: -3.5273 | rev_kl: 1.8690 | stu_lens: 50.6875 | mixed_lens: 49.7500 | lr: 4.4500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 94/ 160| tot_loss: 5.3309 | rl_loss: 1.8881 | pt_loss: 3.4429 | pg_loss: 0.0555 | reg_loss: 1.8325 | reward: -1.5225 | rev_kl: 2.3576 | stu_lens: 83.3750 | mixed_lens: 60.8750 | lr: 4.5000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 5 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 95/ 160| tot_loss: 6.3594 | rl_loss: 3.3826 | pt_loss: 2.9768 | pg_loss: 0.6333 | reg_loss: 2.7492 | reward: -0.9434 | rev_kl: 2.6928 | stu_lens: 93.5000 | mixed_lens: 97.8125 | lr: 4.5500e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 5 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 96/ 160| tot_loss: 6.4160 | rl_loss: 3.3136 | pt_loss: 3.1024 | pg_loss: 1.8859 | reg_loss: 1.4277 | reward: -1.2217 | rev_kl: 1.9516 | stu_lens: 44.9375 | mixed_lens: 48.6250 | lr: 4.6000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 97/ 160| tot_loss: 7.1303 | rl_loss: 3.8633 | pt_loss: 3.2670 | pg_loss: 1.4134 | reg_loss: 2.4500 | reward: -0.8579 | rev_kl: 2.2739 | stu_lens: 78.7500 | mixed_lens: 57.1250 | lr: 4.6500e-06 | scale: 1024.00 | time: 5.013 | step time: 5.013 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 98/ 160| tot_loss: 6.3917 | rl_loss: 3.3670 | pt_loss: 3.0246 | pg_loss: 0.9592 | reg_loss: 2.4078 | reward: -1.0117 | rev_kl: 1.8512 | stu_lens: 64.0625 | mixed_lens: 67.3750 | lr: 4.7000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 99/ 160| tot_loss: 6.9055 | rl_loss: 3.3964 | pt_loss: 3.5091 | pg_loss: 0.9960 | reg_loss: 2.4004 | reward: -2.0723 | rev_kl: 3.4908 | stu_lens: 45.9375 | mixed_lens: 56.3750 | lr: 4.7500e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 eval | rougeL: 24.365 | exact_match: 3.500 | rev_kl: 2.109 | lens: 72.274 | pt_loss: 3.020 | lm_loss: 3.466 | kd_loss: 2.574 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 100/ 160| tot_loss: 7.2655 | rl_loss: 4.2092 | pt_loss: 3.0564 | pg_loss: 1.8737 | reg_loss: 2.3355 | reward: -1.5566 | rev_kl: 2.4011 | stu_lens: 99.9375 | mixed_lens: 55.2500 | lr: 4.8000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 101/ 160| tot_loss: 6.7582 | rl_loss: 3.7091 | pt_loss: 3.0492 | pg_loss: 1.0853 | reg_loss: 2.6238 | reward: -1.2754 | rev_kl: 2.2437 | stu_lens: 92.8750 | mixed_lens: 72.5625 | lr: 4.8500e-06 | scale: 1024.00 | time: 0.662 | step time: 0.662 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 102/ 160| tot_loss: 5.7520 | rl_loss: 2.4874 | pt_loss: 3.2646 | pg_loss: 0.8479 | reg_loss: 1.6394 | reward: -1.2607 | rev_kl: 3.1277 | stu_lens: 70.1875 | mixed_lens: 73.5000 | lr: 4.9000e-06 | scale: 1024.00 | time: 0.369 | step time: 0.369 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 103/ 160| tot_loss: 7.1018 | rl_loss: 4.1950 | pt_loss: 2.9069 | pg_loss: 1.8200 | reg_loss: 2.3749 | reward: -1.8018 | rev_kl: 2.7699 | stu_lens: 59.0000 | mixed_lens: 36.1250 | lr: 4.9500e-06 | scale: 1024.00 | time: 3.445 | step time: 3.445 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 104/ 160| tot_loss: 6.6216 | rl_loss: 3.6208 | pt_loss: 3.0008 | pg_loss: 1.2250 | reg_loss: 2.3958 | reward: -1.1611 | rev_kl: 1.8757 | stu_lens: 66.6250 | mixed_lens: 53.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.374 | step time: 0.374 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 105/ 160| tot_loss: 6.6670 | rl_loss: 3.4995 | pt_loss: 3.1676 | pg_loss: 1.0114 | reg_loss: 2.4881 | reward: -1.5938 | rev_kl: 2.7500 | stu_lens: 79.0000 | mixed_lens: 80.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 106/ 160| tot_loss: 6.2633 | rl_loss: 3.5184 | pt_loss: 2.7449 | pg_loss: 1.1150 | reg_loss: 2.4034 | reward: -1.8125 | rev_kl: 2.3870 | stu_lens: 55.0625 | mixed_lens: 50.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 107/ 160| tot_loss: 7.7823 | rl_loss: 4.6325 | pt_loss: 3.1498 | pg_loss: 2.2491 | reg_loss: 2.3835 | reward: -1.1035 | rev_kl: 1.7522 | stu_lens: 55.8125 | mixed_lens: 36.7500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 108/ 160| tot_loss: 5.4438 | rl_loss: 2.3230 | pt_loss: 3.1208 | pg_loss: 0.7814 | reg_loss: 1.5415 | reward: -0.9888 | rev_kl: 3.1278 | stu_lens: 98.8125 | mixed_lens: 67.8750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.369 | step time: 0.369 train | epoch 6 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 109/ 160| tot_loss: 7.3565 | rl_loss: 4.1684 | pt_loss: 3.1881 | pg_loss: 1.5070 | reg_loss: 2.6614 | reward: -1.2510 | rev_kl: 3.0104 | stu_lens: 47.4375 | mixed_lens: 62.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 6 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 110/ 160| tot_loss: 6.9757 | rl_loss: 4.0425 | pt_loss: 2.9332 | pg_loss: 1.5346 | reg_loss: 2.5078 | reward: -1.9541 | rev_kl: 2.2952 | stu_lens: 88.3750 | mixed_lens: 50.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 6 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 111/ 160| tot_loss: 5.9752 | rl_loss: 2.8322 | pt_loss: 3.1430 | pg_loss: 1.1665 | reg_loss: 1.6657 | reward: -1.1338 | rev_kl: 2.7708 | stu_lens: 84.1875 | mixed_lens: 65.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 6 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 112/ 160| tot_loss: 6.2955 | rl_loss: 3.1527 | pt_loss: 3.1428 | pg_loss: 0.8947 | reg_loss: 2.2580 | reward: -1.1602 | rev_kl: 1.9407 | stu_lens: 68.6875 | mixed_lens: 57.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.654 | step time: 0.654 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 113/ 160| tot_loss: 6.4900 | rl_loss: 3.4650 | pt_loss: 3.0250 | pg_loss: 0.7889 | reg_loss: 2.6761 | reward: -1.4492 | rev_kl: 2.3875 | stu_lens: 54.6250 | mixed_lens: 76.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.438 | step time: 0.438 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 114/ 160| tot_loss: 5.8606 | rl_loss: 2.7508 | pt_loss: 3.1097 | pg_loss: 0.4064 | reg_loss: 2.3445 | reward: -0.9619 | rev_kl: 2.1390 | stu_lens: 95.6250 | mixed_lens: 109.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 115/ 160| tot_loss: 7.3387 | rl_loss: 4.1205 | pt_loss: 3.2182 | pg_loss: 1.3736 | reg_loss: 2.7469 | reward: -1.4561 | rev_kl: 2.0730 | stu_lens: 69.4375 | mixed_lens: 62.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 116/ 160| tot_loss: 6.7367 | rl_loss: 3.5925 | pt_loss: 3.1442 | pg_loss: 1.0952 | reg_loss: 2.4973 | reward: -0.4773 | rev_kl: 2.5111 | stu_lens: 62.1875 | mixed_lens: 64.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 117/ 160| tot_loss: 5.7252 | rl_loss: 2.5792 | pt_loss: 3.1460 | pg_loss: 0.3592 | reg_loss: 2.2200 | reward: -2.0254 | rev_kl: 2.4545 | stu_lens: 85.4375 | mixed_lens: 58.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 118/ 160| tot_loss: 5.7815 | rl_loss: 2.7799 | pt_loss: 3.0017 | pg_loss: 0.7179 | reg_loss: 2.0620 | reward: -0.6519 | rev_kl: 1.8954 | stu_lens: 51.0625 | mixed_lens: 103.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 119/ 160| tot_loss: 6.6016 | rl_loss: 3.6464 | pt_loss: 2.9552 | pg_loss: 0.9027 | reg_loss: 2.7437 | reward: -0.4995 | rev_kl: 2.6690 | stu_lens: 84.5000 | mixed_lens: 80.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 eval | rougeL: 23.271 | exact_match: 3.100 | rev_kl: 1.938 | lens: 72.509 | pt_loss: 3.022 | lm_loss: 3.471 | kd_loss: 2.572 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 120/ 160| tot_loss: 7.6094 | rl_loss: 4.4114 | pt_loss: 3.1981 | pg_loss: 1.3944 | reg_loss: 3.0169 | reward: -1.1670 | rev_kl: 2.0917 | stu_lens: 60.8750 | mixed_lens: 69.6250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 121/ 160| tot_loss: 6.7497 | rl_loss: 3.5124 | pt_loss: 3.2373 | pg_loss: 0.8017 | reg_loss: 2.7107 | reward: -1.1982 | rev_kl: 2.9580 | stu_lens: 87.1250 | mixed_lens: 86.0625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 122/ 160| tot_loss: 7.0310 | rl_loss: 3.8811 | pt_loss: 3.1499 | pg_loss: 1.7440 | reg_loss: 2.1371 | reward: -0.7183 | rev_kl: 1.8985 | stu_lens: 51.6875 | mixed_lens: 64.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 123/ 160| tot_loss: 6.0254 | rl_loss: 2.6467 | pt_loss: 3.3787 | pg_loss: 0.6445 | reg_loss: 2.0022 | reward: -1.2080 | rev_kl: 2.2109 | stu_lens: 58.8750 | mixed_lens: 77.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.363 | step time: 0.363 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 124/ 160| tot_loss: 6.5754 | rl_loss: 3.3731 | pt_loss: 3.2023 | pg_loss: 0.5857 | reg_loss: 2.7874 | reward: -1.2207 | rev_kl: 2.0432 | stu_lens: 84.1875 | mixed_lens: 84.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 125/ 160| tot_loss: 6.8055 | rl_loss: 3.5715 | pt_loss: 3.2340 | pg_loss: 1.0180 | reg_loss: 2.5534 | reward: -1.1279 | rev_kl: 1.9385 | stu_lens: 69.0000 | mixed_lens: 85.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 126/ 160| tot_loss: 5.9534 | rl_loss: 2.5188 | pt_loss: 3.4346 | pg_loss: 0.2802 | reg_loss: 2.2386 | reward: -1.1680 | rev_kl: 2.4618 | stu_lens: 107.8750 | mixed_lens: 100.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 7 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 127/ 160| tot_loss: 7.2144 | rl_loss: 3.9797 | pt_loss: 3.2347 | pg_loss: 1.2139 | reg_loss: 2.7658 | reward: -0.5410 | rev_kl: 2.7889 | stu_lens: 54.1875 | mixed_lens: 78.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 7 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 128/ 160| tot_loss: 6.2720 | rl_loss: 3.2531 | pt_loss: 3.0189 | pg_loss: 0.9293 | reg_loss: 2.3237 | reward: -1.5078 | rev_kl: 1.9215 | stu_lens: 50.8125 | mixed_lens: 47.6250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 129/ 160| tot_loss: 6.8848 | rl_loss: 3.7039 | pt_loss: 3.1809 | pg_loss: 1.1745 | reg_loss: 2.5294 | reward: -1.6123 | rev_kl: 1.9468 | stu_lens: 60.6250 | mixed_lens: 52.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.469 | step time: 0.469 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 130/ 160| tot_loss: 6.3558 | rl_loss: 3.3180 | pt_loss: 3.0378 | pg_loss: 0.6808 | reg_loss: 2.6372 | reward: -1.6797 | rev_kl: 3.0009 | stu_lens: 87.9375 | mixed_lens: 93.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 131/ 160| tot_loss: 7.8841 | rl_loss: 5.1608 | pt_loss: 2.7233 | pg_loss: 2.1873 | reg_loss: 2.9735 | reward: -1.1123 | rev_kl: 1.7497 | stu_lens: 92.0625 | mixed_lens: 62.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 132/ 160| tot_loss: 5.6821 | rl_loss: 2.5853 | pt_loss: 3.0968 | pg_loss: 0.9442 | reg_loss: 1.6410 | reward: -1.1426 | rev_kl: 2.6260 | stu_lens: 68.4375 | mixed_lens: 59.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 133/ 160| tot_loss: 6.3715 | rl_loss: 3.2097 | pt_loss: 3.1617 | pg_loss: 0.7776 | reg_loss: 2.4321 | reward: -1.5186 | rev_kl: 1.6848 | stu_lens: 92.0625 | mixed_lens: 72.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 134/ 160| tot_loss: 6.7943 | rl_loss: 3.4563 | pt_loss: 3.3380 | pg_loss: 1.2914 | reg_loss: 2.1649 | reward: -1.4033 | rev_kl: 2.7789 | stu_lens: 61.7500 | mixed_lens: 59.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 135/ 160| tot_loss: 7.4445 | rl_loss: 4.1550 | pt_loss: 3.2895 | pg_loss: 1.2189 | reg_loss: 2.9361 | reward: -1.2891 | rev_kl: 2.4576 | stu_lens: 79.0625 | mixed_lens: 69.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 136/ 160| tot_loss: 7.1016 | rl_loss: 3.9315 | pt_loss: 3.1701 | pg_loss: 1.3390 | reg_loss: 2.5925 | reward: -1.3369 | rev_kl: 2.4021 | stu_lens: 76.1875 | mixed_lens: 65.2500 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 137/ 160| tot_loss: 8.2712 | rl_loss: 4.9352 | pt_loss: 3.3359 | pg_loss: 1.9644 | reg_loss: 2.9708 | reward: -1.5801 | rev_kl: 2.3652 | stu_lens: 58.3125 | mixed_lens: 75.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 138/ 160| tot_loss: 5.7539 | rl_loss: 2.8656 | pt_loss: 2.8883 | pg_loss: 0.7364 | reg_loss: 2.1291 | reward: -0.5889 | rev_kl: 2.5494 | stu_lens: 90.8750 | mixed_lens: 67.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 139/ 160| tot_loss: 6.5031 | rl_loss: 3.8434 | pt_loss: 2.6597 | pg_loss: 1.2734 | reg_loss: 2.5700 | reward: -1.6953 | rev_kl: 1.6881 | stu_lens: 99.3125 | mixed_lens: 56.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 eval | rougeL: 24.310 | exact_match: 3.100 | rev_kl: 1.879 | lens: 70.543 | pt_loss: 3.021 | lm_loss: 3.473 | kd_loss: 2.569 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 140/ 160| tot_loss: 6.6860 | rl_loss: 3.4814 | pt_loss: 3.2046 | pg_loss: 0.8524 | reg_loss: 2.6290 | reward: -1.6836 | rev_kl: 2.7207 | stu_lens: 60.5625 | mixed_lens: 68.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.363 | step time: 0.363 train | epoch 8 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 141/ 160| tot_loss: 6.2668 | rl_loss: 3.1277 | pt_loss: 3.1390 | pg_loss: 0.6834 | reg_loss: 2.4444 | reward: -1.5801 | rev_kl: 1.7998 | stu_lens: 124.5625 | mixed_lens: 60.3750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 142/ 160| tot_loss: 6.6961 | rl_loss: 3.5944 | pt_loss: 3.1017 | pg_loss: 1.6158 | reg_loss: 1.9787 | reward: -1.8818 | rev_kl: 2.7080 | stu_lens: 40.5625 | mixed_lens: 45.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 8 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 143/ 160| tot_loss: 6.8511 | rl_loss: 3.7522 | pt_loss: 3.0990 | pg_loss: 1.1136 | reg_loss: 2.6385 | reward: -0.7070 | rev_kl: 2.4555 | stu_lens: 66.6250 | mixed_lens: 76.8125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 8 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 144/ 160| tot_loss: 6.9484 | rl_loss: 3.9231 | pt_loss: 3.0253 | pg_loss: 1.3164 | reg_loss: 2.6066 | reward: -1.3789 | rev_kl: 2.3601 | stu_lens: 77.3125 | mixed_lens: 84.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 0/ 4 | global iter: 145/ 160| tot_loss: 5.6491 | rl_loss: 2.6947 | pt_loss: 2.9544 | pg_loss: 0.7071 | reg_loss: 1.9876 | reward: -1.4395 | rev_kl: 2.0307 | stu_lens: 56.5000 | mixed_lens: 48.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 3.214 | step time: 3.214 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 0/ 4 | global iter: 146/ 160| tot_loss: 7.5079 | rl_loss: 4.1240 | pt_loss: 3.3838 | pg_loss: 1.5348 | reg_loss: 2.5892 | reward: -1.1895 | rev_kl: 2.0244 | stu_lens: 124.4375 | mixed_lens: 77.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 0/ 4 | global iter: 147/ 160| tot_loss: 4.8806 | rl_loss: 1.7228 | pt_loss: 3.1578 | pg_loss: 0.6120 | reg_loss: 1.1108 | reward: -0.9238 | rev_kl: 1.7947 | stu_lens: 77.2500 | mixed_lens: 82.1875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 0/ 4 | global iter: 148/ 160| tot_loss: 8.5711 | rl_loss: 5.4525 | pt_loss: 3.1186 | pg_loss: 3.0341 | reg_loss: 2.4184 | reward: -2.3066 | rev_kl: 3.0048 | stu_lens: 68.3125 | mixed_lens: 45.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 1/ 4 | global iter: 149/ 160| tot_loss: 5.6774 | rl_loss: 2.5344 | pt_loss: 3.1429 | pg_loss: 0.6879 | reg_loss: 1.8465 | reward: -1.2559 | rev_kl: 1.8338 | stu_lens: 78.3125 | mixed_lens: 72.5000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 1/ 4 | global iter: 150/ 160| tot_loss: 4.7362 | rl_loss: 1.9195 | pt_loss: 2.8167 | pg_loss: 0.7387 | reg_loss: 1.1808 | reward: -1.2725 | rev_kl: 2.1225 | stu_lens: 124.2500 | mixed_lens: 71.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 1/ 4 | global iter: 151/ 160| tot_loss: 7.6812 | rl_loss: 4.5739 | pt_loss: 3.1073 | pg_loss: 2.2049 | reg_loss: 2.3690 | reward: -2.2363 | rev_kl: 2.2422 | stu_lens: 54.5000 | mixed_lens: 33.8750 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.364 | step time: 0.364 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 1/ 4 | global iter: 152/ 160| tot_loss: 6.8422 | rl_loss: 3.4832 | pt_loss: 3.3590 | pg_loss: 1.2115 | reg_loss: 2.2717 | reward: -1.0957 | rev_kl: 2.6562 | stu_lens: 69.4375 | mixed_lens: 76.6875 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 2/ 4 | global iter: 153/ 160| tot_loss: 4.6541 | rl_loss: 1.4562 | pt_loss: 3.1979 | pg_loss: 0.4877 | reg_loss: 0.9685 | reward: -1.6797 | rev_kl: 3.0440 | stu_lens: 47.3750 | mixed_lens: 62.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 2/ 4 | global iter: 154/ 160| tot_loss: 6.3939 | rl_loss: 3.1886 | pt_loss: 3.2053 | pg_loss: 1.2633 | reg_loss: 1.9253 | reward: -1.2373 | rev_kl: 1.8693 | stu_lens: 96.5625 | mixed_lens: 56.4375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 2/ 4 | global iter: 155/ 160| tot_loss: 6.1353 | rl_loss: 3.2053 | pt_loss: 2.9300 | pg_loss: 1.1344 | reg_loss: 2.0709 | reward: -1.0049 | rev_kl: 1.9559 | stu_lens: 111.3125 | mixed_lens: 89.5625 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 2/ 4 | global iter: 156/ 160| tot_loss: 7.0041 | rl_loss: 3.8199 | pt_loss: 3.1842 | pg_loss: 1.5461 | reg_loss: 2.2738 | reward: -1.9375 | rev_kl: 1.9854 | stu_lens: 71.2500 | mixed_lens: 45.8125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 0/ 4 | ppo epoch: 3/ 4 | global iter: 157/ 160| tot_loss: 7.0166 | rl_loss: 4.0733 | pt_loss: 2.9433 | pg_loss: 1.5655 | reg_loss: 2.5078 | reward: -1.6660 | rev_kl: 2.0529 | stu_lens: 96.2500 | mixed_lens: 69.9375 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.366 | step time: 0.366 train | epoch 9 | inner iter: 1/ 4 | ppo epoch: 3/ 4 | global iter: 158/ 160| tot_loss: 4.9263 | rl_loss: 1.7438 | pt_loss: 3.1825 | pg_loss: 0.5495 | reg_loss: 1.1943 | reward: -1.8271 | rev_kl: 2.2048 | stu_lens: 87.6250 | mixed_lens: 62.1250 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.365 | step time: 0.365 train | epoch 9 | inner iter: 2/ 4 | ppo epoch: 3/ 4 | global iter: 159/ 160| tot_loss: 6.7466 | rl_loss: 3.4755 | pt_loss: 3.2712 | pg_loss: 1.6377 | reg_loss: 1.8378 | reward: -0.9443 | rev_kl: 2.1583 | stu_lens: 80.3125 | mixed_lens: 66.0000 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.368 | step time: 0.368 eval | rougeL: 24.353 | exact_match: 3.100 | rev_kl: 1.964 | lens: 74.957 | pt_loss: 3.017 | lm_loss: 3.468 | kd_loss: 2.567 train | epoch 9 | inner iter: 3/ 4 | ppo epoch: 3/ 4 | global iter: 160/ 160| tot_loss: 5.7856 | rl_loss: 2.4925 | pt_loss: 3.2931 | pg_loss: 0.9332 | reg_loss: 1.5593 | reward: -1.4229 | rev_kl: 2.4387 | stu_lens: 62.3125 | mixed_lens: 56.3125 | lr: 5.0000e-06 | scale: 1024.00 | time: 0.367 | step time: 0.367
max roughL is 24.353, and when I evaluate the model on dolly, I get 22.40, which is far less than 24.6
@t1101675 could you please help to figure out where I am wrong? thanks
the cmd is:
MASTER_ADDR=localhost MASTER_PORT=${2-2012} NNODES=1 NODE_RANK=0 GPUS_PER_NODE=${3-4}
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \ --nnodes $NNODES \ --node_rank $NODE_RANK \ --master_addr $MASTER_ADDR \ --master_port $MASTER_PORT"
BASE_PATH=${1-"xxxx"} CKPT_NAME="base-init" CKPT="${BASE_PATH}/results/gpt2/train/minillm_init/gpt2-base" TEACHER_CKPT_NAME="xlarge-sft" TEACHER_CKPT="${BASE_PATH}/results/gpt2/train/sft/gpt2-xlarge/"
PROMPT_DATA_DIR="${BASE_PATH}/processed_data/dolly/prompt/gpt2/" LM_DATA_DIR="${BASE_PATH}/processed_data/openwebtext/gpt2/512/10M/"
SAVE_PATH="${BASE_PATH}/results/gpt2/train/minillm/"
GRAD_ACC=1 BATCH_SIZE=4 CHUNK_SIZE=16
OPTS=""
OPTS+=" --base-path ${BASE_PATH}" OPTS+=" --model-path ${CKPT}" OPTS+=" --teacher-model-path ${TEACHER_CKPT}" OPTS+=" --ckpt-name ${CKPT_NAME}" OPTS+=" --teacher-ckpt-name ${TEACHER_CKPT_NAME}" OPTS+=" --n-gpu ${GPUS_PER_NODE}" OPTS+=" --teacher-model-fp16"
OPTS+=" --prompt-data-dir ${PROMPT_DATA_DIR}" OPTS+=" --lm-data-dir ${LM_DATA_DIR}" OPTS+=" --dev-num 1000" OPTS+=" --num-workers 0"
OPTS+=" --epochs 10" OPTS+=" --total-iters 5000" OPTS+=" --kd-ratio 0.5" OPTS+=" --batch-size ${BATCH_SIZE}" OPTS+=" --lr 5e-6" OPTS+=" --lr-min 5e-6" OPTS+=" --gradient-accumulation-steps ${GRAD_ACC}" OPTS+=" --max-length 512" OPTS+=" --max-prompt-length 256" OPTS+=" --warmup-iters 100"
OPTS+=" --save ${SAVE_PATH}" OPTS+=" --seed 10" OPTS+=" --seed-ppo 42" OPTS+=" --seed-lm 7" OPTS+=" --save-interval 100" OPTS+=" --eval-interval 20" OPTS+=" --log-interval 1" OPTS+=" --mid-log-num 1"
OPTS+=" --type minillm" OPTS+=" --ppo-epochs 4" OPTS+=" --num-rollouts 16" OPTS+=" --chunk-size ${CHUNK_SIZE}"
OPTS+=" --length-norm" OPTS+=" --single-step-reg" OPTS+=" --teacher-mixed-alpha 0.2"
OPTS+=" --reward-scaling 0.5" OPTS+=" --cliprange-reward 100"
OPTS+=" --do-sample" OPTS+=" --top-k 0" OPTS+=" --top-p 1.0" OPTS+=" --temperature 1.0"
OPTS+=" --deepspeed" OPTS+=" --deepspeed_config ${BASE_PATH}/configs/deepspeed/ds_config.json"
export NCCL_DEBUG="" export WANDB_DISABLED=True export TF_CPP_MIN_LOG_LEVEL=3 export PYTHONPATH=${BASE_PATH} CMD="torchrun ${DISTRIBUTED_ARGS} ${BASE_PATH}/train_minillm.py ${OPTS} $@"
echo ${CMD} echo "PYTHONPATH=${PYTHONPATH}" mkdir -p ${SAVE_PATH} ${CMD}
Change the batch size to 16, so that 4 GPU * 16= 64, then we get 19.82 on the dolly, a worse result.
--epochs
should be larger (> 300) and you will get the total global iters = 5000 in the log--num-rollouts
should be larger (num-rollouts * num-gpus should be 256)--chunk-size
for more efficient training.We have updated our code for the convenience to set these hyper-parameters. You can ignore 1 & 2 if you use current scripts.
thanks Let me try the new hyper-parameter
thanks for your kind help. On dolly, mine ~= paper. On UnNI and selfInst and S-NI mine < paper. On Vicuna mine > paper. I think the seeds may cause it.
Thanks again. BTW, In pt_loss, it is
(1-self.args.kd_ratio) lm_loss + self.args.kd_ratio distil_loss
and the distil_loss is:
teacher_probs = F.softmax(teacher_logits, dim=-1, dtype=torch.float32) inf_mask = torch.isinf(logits) logprobs = F.log_softmax(logits, dim=-1, dtype=torch.float32) prod_probs = torch.masked_fill(teacher_probs logprobs, inf_mask, 0) x = torch.sum(prod_probs, dim=-1).view(-1) distil_loss = -torch.sum(x loss_mask.view(-1), dim=0) / torch.sum(loss_mask.view(-1), dim=0) #! div 0 风险
This distil_loss is forward KL. Meanwhile, I noticed that you use the get_rev_kl function in sampler.py, which is reverse KL. Could you please help figure out why two distill losses there?
thanks very much!
The forward KL loss works as a regularization to prevent the model from collapsing to a single mode when using reverse KL. The effect of this loss is controlled by args.kd_ratio and is optional because we find that the result does not change much with different args.kd_ratio (kd_ratio=0 means only using lm_loss).
@wutaiqiang @t1101675 hi, bro! In which file can I set the parameters of the distillation model, such as distill gpt2-1.5B -> gpt2-120M? Thanks?
I try to distill gpt2-1.5B -> gpt2-120M As I use 4 A100, so I change the GPUS_PER_NODE to ${3-4}
Batch size remains the same