Closed Ke-Hao closed 1 year ago
Consider the computation graph for each GD step. gw
is the output of that.
I don't see synthetic data being updated in accumate_grad()
, which simply gathers gradients (of synthetic data) from all GD steps.
optimizer.step()
updates the synthetic data according to the distillation loss. So it is naturally called in train()
, which trains the synthetic data.
Got it! Thanks for your reply!!
Hi, I found these codes confusing : codes why the outputs is 'gw' , and the synthetic data is updated in the accumulate_grad() , why there is a optimizer.step() in the function train()?