Closed hholst80 closed 8 years ago
By (v - R) ** 2 / 2
I mean just dividing squared errors by 2, which I think is common, though I'm not sure if the authors also did so.
By v_loss_coef
I mean a scaling factor for tuning the relative learning rate of v. One of the authors told me they multiplied the gradients of v by 0.5.
It seems a bit non-standard to scale both terms and sum like this. I'm trying without the 0.5 sum scaling and just dividing the terms as you do.
Ofc for a particular instance there could be reason to balance the two loss functions differently so the constants are good to have. I was just curious about their default values. Thanks for your clarification!
The loss function
v_loss
is accumulated likebut then it is scaled with
v_loss *= self.v_loss_coef
wherev_loss_coef
is 0.5 by default.Is there a reason why we're scaling it twice, termwise and also the final sum?