From #4 , I can understand why to use scale in HVP since the real loss function is scaled. But I don't know why to use scale at the end on line 507-510.
In the loop, we scale the Hessian down by scale, which means that the estimate of the inverse Hessian-vector product will be scaled up by scale. The last division corrects for this scaling.
From #4 , I can understand why to use scale in HVP since the real loss function is scaled. But I don't know why to use scale at the end on line 507-510.