Closed MezereonXP closed 1 year ago
I have understood the logic! I will close that issue and present the details of that part.
Notation:
The synthetic data: $s_1, s_2, ..., s_n$
Real data: $(x,y)$ Assume the initial parameter is $w_0$
Use the gradient descent after using every synthetic sample:
w_t = w_{t-1} - \eta_{t-1} \nabla_{w_{t-1}}L_t
As for the real data $(x,y)$ , the corresponding loss with $w_{n}$ is:
L = L(f(x;w_{n}), y)
We could use this loss $L$ to update the syhthetic data.
As for $s_n$ , we need the gradient to update it:
\frac{\partial L}{\partial s_n} = \frac{\partial L}{\partial w_n}\cdot\frac{\partial w_n}{\partial s_n}
Since
w_n = w_{n-1} - \eta_{n-1} \nabla_{w_{n-1}}L_{n-1}
Then we have:
\frac{\partial w_n}{\partial s_n} = \frac{\partial}{\partial s_n} (- \eta_{n-1}\nabla_{w_{n-1}}L_{n-1})
The hvp_grad
is the gradient $\frac{\partial L}{\partial s_n}$ (When we only consider the synthetic data without learnable labels and leraning rates)
Hello, I have cloned this repo and try to understand the code.
However, I have found some weird things in the
Trainer
class of thetrain_distilled_image.py
That
Trainer
class has a method namedbackward
The
param
andgws
come from the forward function, but they have different length!In the
backward
function, there is a zip methodzip(steps, params, gws)
will return a shorter list. It ignores the final elements ofparams
.Question-1: Is that a mistake? Will that final element of the
params
affect the training?In the
backward
function:In the first iteration: Here, the first
w
isparams[-2]
and thehvp_grad
contains the gradient ofgw
respect toparam[-2]
. However, the firstdgw
is the gradient of the loss respect toparam[-1]
.I cannot fully understand the meaning of the
hvp_grad
. (Newton method?)Question-2: The logic of
hvg_grad
is hard to understand. Could you please explain the detail of that gradients?