Optimizers of inner problem and fixed point maps in approximate implicit hypergradient computations do not necessarily match

prolearner / hypertorch

MIT License

119 stars 16 forks source link

Optimizers of inner problem and fixed point maps in approximate implicit hypergradient computations do not necessarily match #8

Closed dionman closed 2 years ago

dionman commented 2 years ago

In some of the examples (e.g. in iMAML) I notice that the inner optimization corresponds to a different fixed point map compared to the one used in the approximate implicit hypergradient function call (e.g. both use GradientDescent steps, but have different learning rates). Is it required that the same fixed point map is used at both places, or is there no such necessity? I see in the paper that the approximate implicit hypergradient does not use information of the trajectory to the approximate solution of the inner problem (it only depends on the approximate solution itself), however I'm confused with the semantics of the fp_map in the implementation of the implicit hypergradient functions.

prolearner commented 2 years ago

Updated on 22/03/22 since previous answer was wrong.

In the case where the fixed point of the bilevel problem corresponds to the gradient descent map you can have any value for the learning rate except 0. This is true because the fixed point does not depend on the specific learning rate. Since this is the case you may use different fp_map functions on the inner solver and when computing the hyper-gradient.

dionman commented 2 years ago

How does this generalize for inner fixed point maps corresponding to gradient descent with momentum? Is it correct to combine an Adam optimizer to compute the approximate fixed point with a gradient descent optimizer for the linear system?

prolearner commented 2 years ago

You can use completely different optimisers for the inner problem and linear system as long as they solve the right problem. In the case of gradient descent with momentum you can have different learning learning rates and momentum parameters for the inner problem and linear system, same as with gradient descent. You can try and see what works best for your application.