[chatllama]Puzzled about the update of the critic model

When I looked into the compute the value loss in trainer.py line 1012-1017,

value_loss_clipped = old_values + (values - old_values).clamp(-critic_eps_clip, critic_eps_clip)
value_loss1 = (value_loss_clipped - rewards) ** 2
value_loss2 = (values - rewards) ** 2
value_loss = torch.max(value_loss1, value_loss2).mean()

I think the values and rewards are equal to the old_values, cause they use the same model to compute the score. I will be very grateful if you guys can answer my confuse.

nebuly-ai / optimate

[chatllama]Puzzled about the update of the critic model #338