Closed YifanHao closed 2 months ago
Thanks for your question! We find removing $w_t$ from the single-step part will make the training converges splightly faster in our pilot studies. Sorry for the confusion. We will clarify this in our code and our paper in the next version.
Thanks for your reply!
Hello, I was trying to have a better understanding of your method, but I got a little confused about the single-step part in the loss.
ppo_loss In your implementation, I saw you compute
ppo_loss = pg_loss + single_step_reg_loss
. It seems thesingle_step_reg_loss
corresponds to the L_single part in the formula, andpg_loss
is the L^norm_Long part. Then, thesingle_step_reg_loss
is computed by_reg_loss
, which seems like an average of step-wise reversed KLD on sampled sequence, and reward is not involved in the computation.However, according to fomula above, the single-step part should be an weighted average of step-wise rKLD, with reward at each step as the weight.
So why is the reward weight missing in the implemetation?Am I missing something, and could you help me clarify it? Thanks for your help!