nebuly-ai / optimate

A collection of libraries to optimise AI model performances
https://www.nebuly.com/
Apache License 2.0
8.37k stars 643 forks source link

[chatllama]Puzzled about the update of the critic model #338

Open zhuweipg99 opened 1 year ago

zhuweipg99 commented 1 year ago

When I looked into the compute the value loss in trainer.py line 1012-1017,

value_loss_clipped = old_values + (values - old_values).clamp(-critic_eps_clip, critic_eps_clip)
value_loss1 = (value_loss_clipped - rewards) ** 2
value_loss2 = (values - rewards) ** 2
value_loss = torch.max(value_loss1, value_loss2).mean()

I think the values and rewards are equal to the old_values, cause they use the same model to compute the score. I will be very grateful if you guys can answer my confuse.