Closed ligeng0197 closed 2 years ago
Thank for reading our paper!
Indeed, that sentence is wrong... We should have updated our paper... Unlike the sentence, the first-order gradient is non-zero because the loss function is directly modulated by $\phi$. Therefore, we can apply first-order method to approximate the meta-gradient.
However, our implementation use the full second order method. This is automatically done in tensorflow, whereas in pytorch you should use some library (e.g. higher) or manually call torch.autograd function. I guess you are familiar with pytorch and maybe that's why you ask the question.
Thank you! Hae Beom Lee
FYI, We have tried first-order approximation but the performance is much lower than our current implementation.
Thanks for your prompt response! (That's quite fast) ^_^ The version of paper I read makes me consider the noise genrator only relies on the loss or logP of D^te to update. In this case, I cannot figure out how to cal second-order derivative. Does it also rely on the inner optimization (maintain a computation graph in pytorch)?
Yes, I mainly use pytorch and am not really familiar with the higher derivative mechanism of tensorflow. >_<
P.s. where can i get the latest version of your paper, the version downloaded from openreview seems still keep the sentence i quoted.
BTW, I am not sure the implementation details of \mu function is listed in the paper version I read(may account for my carelessness), would you mind showing me some clues about it?
Yes, the second-order derivative depends on the inner-optimization computational graph. In Tensorflow, the computational graphs are automatically created so that the second order derivative can refer to them.
Sorry, we haven't updated the paper yet.. I will update ASAP.
For the implementation details of \mu, I recommend going through the actual implementation, which should read quite straightforward. Please see layers.py line 20-34.
Thank you!
Hi, Lee! Gladly to read your inspiring paper. I met a little problem in understanding the sentence in your paper that "Lastly, to compute the gradient of Eq. 4 w.r.t. φ, we must compute second-order derivative, otherwise the gradient w.r.t. φ will always be zero. " in 3.1 section, the paragraph under equation (4). May I ask why the gradient will be zero? I think the Eq.4 describes a function of φ and I find the algorithm in appendix doesn't use second order derivative to update φ. Thanks ahead!