Open dwyzzy opened 1 year ago
You are right! In addition to the KL-reward, the advantages and returns are also recomputed! They are all "exp_data" in other frameworks, such as ColossalChat, which is pictured below.
Hi, thank you for pointing this out. Yes, they are recomputed. For now, we are using online-policy, so it is fine. We will try to fix this later
Thank you for the great work! The kl rewards seem to be computed each time calling train_rlhf(). [code]
Both
log_probs
andref_log_probs
are from buffer, which meansold_rewards
is always same for the same episode? Did I make any mistake?