[MiniLLM] mismatch between formula and implementation of the single-step loss？

YifanHao commented 3 months ago

Hello, I was trying to have a better understanding of your method, but I got a little confused about the single-step part in the loss.

ppo_loss In your implementation, I saw you compute ppo_loss = pg_loss + single_step_reg_loss. It seems the single_step_reg_loss corresponds to the L_single part in the formula, and pg_loss is the L^norm_Long part. Then, the single_step_reg_loss is computed by _reg_loss, which seems like an average of step-wise reversed KLD on sampled sequence, and reward is not involved in the computation.

However, according to fomula above, the single-step part should be an weighted average of step-wise rKLD, with reward at each step as the weight.

So why is the reward weight missing in the implemetation？Am I missing something, and could you help me clarify it? Thanks for your help!

t1101675 commented 2 months ago

Thanks for your question! We find removing $w_t$ from the single-step part will make the training converges splightly faster in our pilot studies. Sorry for the confusion. We will clarify this in our code and our paper in the next version.

YifanHao commented 2 months ago

Thanks for your reply!

microsoft / LMOps

[MiniLLM] mismatch between formula and implementation of the single-step loss？ #255