package it as an optimizer in pyotrch?

lixilinx / psgd_torch

Pytorch implementation of preconditioned stochastic gradient descent (affine group preconditioner, low-rank approximation preconditioner and more)

102 stars 7 forks source link

package it as an optimizer in pyotrch? #1

Closed liangbright closed 3 years ago

liangbright commented 3 years ago

great work.

Could you write it as an optimizer, like adam, so that it can be used as a replacement of the optimizers in PyTorch

lixilinx commented 3 years ago

Thanks for the interests. Adam uses an element-wise preconditioner, and thus easier to be implemented as a universal optimizer. But PSGD cares about the parameter spaces to learn the preconditioner on a Lie group (see paper https://openreview.net/forum?id=Bye5SiAqKX). This makes it more difficult to wrap it as a universal optimizer as we have no idea how the model is defined.

If one only uses those standard models like CNN and RNN, one can rearrange the gradients of kernels and bias into a larger matrix and use PSGD. This seems doable ... Otherwise, one may need do quite some low level coding work, and a practical example can be https://github.com/lixilinx/psgd_tf/blob/master/neural_machine_translation_with_attention.py

yaroslavvb commented 3 years ago

It should be possible, but with a lot extra work. Couple of related examples:

KFAC optimizer uses forward/backward hooks to capture activations, and examines module type to determine type of layer, so that preconditioning can be layer-specific: (have to follow the logic from top level main.py)
Another example of optimizer doing do multiple stages of work per iterations, lbfgs.py

lixilinx commented 3 years ago

Hi, Yaroslavvb, thanks for your pointers, very relevant. Yes, workable, but seems quite some work, and also brings some limitations. The KFAC in tensorflow also targets at CNN.

I use GRU/LSTM a lot. My simple solution is just to put the gradients of kernel and its bias returned by cuda implementations together, for example, [gradient_of_recurrent_kernel | gradient_of_recurrent_bias] as the gradient of affine transform matrix [recurrent_kernel | recurrent_bias]. Ad hoc, but simple and works for me. Putting PSGD as a general opt lib seems very challenging.

lixilinx commented 1 year ago

Hi, liangbright, just to info you that currently I am trying to come up with a few general purpose preconditioners. Their usage should be as simple as Adam.