haebeom-lee / metadrop

Tensorflow implementation of "Meta Dropout: Learning to Perturb Latent Features for Generalization" (ICLR 2020)
27 stars 5 forks source link

Question about paper. #6

Closed ligeng0197 closed 2 years ago

ligeng0197 commented 2 years ago

Hi, Lee! Gladly to read your inspiring paper. I met a little problem in understanding the sentence in your paper that "Lastly, to compute the gradient of Eq. 4 w.r.t. φ, we must compute second-order derivative, otherwise the gradient w.r.t. φ will always be zero. " in 3.1 section, the paragraph under equation (4). May I ask why the gradient will be zero? I think the Eq.4 describes a function of φ and I find the algorithm in appendix doesn't use second order derivative to update φ. Thanks ahead!

haebeom-lee commented 2 years ago

Thank for reading our paper!

Indeed, that sentence is wrong... We should have updated our paper... Unlike the sentence, the first-order gradient is non-zero because the loss function is directly modulated by $\phi$. Therefore, we can apply first-order method to approximate the meta-gradient.

However, our implementation use the full second order method. This is automatically done in tensorflow, whereas in pytorch you should use some library (e.g. higher) or manually call torch.autograd function. I guess you are familiar with pytorch and maybe that's why you ask the question.

Thank you! Hae Beom Lee

haebeom-lee commented 2 years ago

FYI, We have tried first-order approximation but the performance is much lower than our current implementation.

ligeng0197 commented 2 years ago

Thanks for your prompt response! (That's quite fast) ^_^ The version of paper I read makes me consider the noise genrator only relies on the loss or logP of D^te to update. In this case, I cannot figure out how to cal second-order derivative. Does it also rely on the inner optimization (maintain a computation graph in pytorch)?

Yes, I mainly use pytorch and am not really familiar with the higher derivative mechanism of tensorflow. >_<

P.s. where can i get the latest version of your paper, the version downloaded from openreview seems still keep the sentence i quoted.

ligeng0197 commented 2 years ago

BTW, I am not sure the implementation details of \mu function is listed in the paper version I read(may account for my carelessness), would you mind showing me some clues about it?

haebeom-lee commented 2 years ago
  1. Yes, the second-order derivative depends on the inner-optimization computational graph. In Tensorflow, the computational graphs are automatically created so that the second order derivative can refer to them.

  2. Sorry, we haven't updated the paper yet.. I will update ASAP.

  3. For the implementation details of \mu, I recommend going through the actual implementation, which should read quite straightforward. Please see layers.py line 20-34.

Thank you!