Closed liangbright closed 3 years ago
Thanks for the interests. Adam uses an element-wise preconditioner, and thus easier to be implemented as a universal optimizer. But PSGD cares about the parameter spaces to learn the preconditioner on a Lie group (see paper https://openreview.net/forum?id=Bye5SiAqKX). This makes it more difficult to wrap it as a universal optimizer as we have no idea how the model is defined.
If one only uses those standard models like CNN and RNN, one can rearrange the gradients of kernels and bias into a larger matrix and use PSGD. This seems doable ... Otherwise, one may need do quite some low level coding work, and a practical example can be https://github.com/lixilinx/psgd_tf/blob/master/neural_machine_translation_with_attention.py
It should be possible, but with a lot extra work. Couple of related examples:
KFAC optimizer uses forward/backward hooks to capture activations, and examines module type to determine type of layer, so that preconditioning can be layer-specific: (have to follow the logic from top level main.py)
Another example of optimizer doing do multiple stages of work per iterations, lbfgs.py
Hi, Yaroslavvb, thanks for your pointers, very relevant. Yes, workable, but seems quite some work, and also brings some limitations. The KFAC in tensorflow also targets at CNN.
I use GRU/LSTM a lot. My simple solution is just to put the gradients of kernel and its bias returned by cuda implementations together, for example, [gradient_of_recurrent_kernel | gradient_of_recurrent_bias] as the gradient of affine transform matrix [recurrent_kernel | recurrent_bias]. Ad hoc, but simple and works for me. Putting PSGD as a general opt lib seems very challenging.
Hi, liangbright, just to info you that currently I am trying to come up with a few general purpose preconditioners. Their usage should be as simple as Adam.
great work.
Could you write it as an optimizer, like adam, so that it can be used as a replacement of the optimizers in PyTorch