Problems with training the model with VOGN (ADAM works)

team-approx-bayes / dl-with-bayes

Contains code for the NeurIPS 2019 paper "Practical Deep Learning with Bayesian Principles"

242 stars 23 forks source link

Problems with training the model with VOGN (ADAM works) #6

Open zanklanecek opened 3 years ago

zanklanecek commented 3 years ago

I am trying out VOGN with ResNet18 for binary classification. With ADAM optimizer I am able to find a set of hyperparameters that yields excellent training and testing results (AUC >0.98). When changing ADAM for VOGN the training fails completely (AUC~0.5, training loss not changing). After trying numerous ranges of hyperparameters I am still having troubles with training the model. Should I keep on trying to train the model with VOGN or is this normal behavior that VOGN doesn't always work?

Thanks!

emtiyaz commented 3 years ago

We have developed a way to tackle hyperparameter tuning which is discussed in the last paragraph in Sec 3 of our paper (https://arxiv.org/pdf/1906.02506.pdf). The key trick is to first tune OGN (no Variational part) similarly to Adam, and then go to Variational OGN afterwards. This often works. Tuning the prior precision, the tempering parameter, and the damping factor will then help you to get better uncertainty than deterministic methods such as ADAM or OGN.

VOGN tuning

emtiyaz commented 3 years ago

Setting of parameter tau, by starting with very small values, and then slowly moving up to 1 is also very important to find good settings of hyperparameters. Hope this helps.

zanklanecek commented 3 years ago

Is there a pytorch implementation of OGN? In the pytorchsso library I haven't been able to find one. The Appendix C in the paper explains that OGN is deterministic version of VOGN and one can just replace u_t with w_t. Is there an easy way to do this with VIOptimizer class?

emtiyaz commented 3 years ago

It should be somewhere in the code. Let me ask @kazukiosawa; but if you find it in the meantime please let us know. Thanks!

kazukiosawa commented 3 years ago

@zanklanecek Thanks for your question! You can train a model with OGN by using SecondOrderOptimizer class, which is the parent class of VIOptimizer. See here for the correspondence of the arguments between these two classes.

zanklanecek commented 3 years ago

Thanks @emtiyaz and @kazukiosawa for fast response 🥇. So to sum up:

If you want OGN as optimizer: curv_shapes = {"Conv2d": "Kron", "Linear": "Diag"} curv_kwargs = {"damping": 1e-3, "ema_decay": 0.999} optimizer = torchsso.optim.SecondOrderOptimizer(model, "Cov", curv_shapes, curv_kwargs)

If you want VOGN as optimizer: optimizer = torchsso.optim.VOGN(model, dataset_size=len(train_loader.dataset))

No dataset_size is needed for OGN?

kazukiosawa commented 3 years ago

actually, for OGN, curv_shapes = {"Conv2d": "Diag", "Linear": "Diag"} (see the definition of VOGN class).

right. OGN doesn't need dataset_size

zanklanecek commented 3 years ago

@kazukiosawa @emtiyaz just to make clear what are the corresponding names of parameters in the code:

prior precision is init_precision ?
tempering parameter is what exactly in code?

Thanks for your help!

kazukiosawa commented 3 years ago

@zanklanecek init_precision is for the posterior of the weights. You can set the prior variance (1/precision) by prior_variance. (Please check https://github.com/cybertronai/pytorch-sso/blob/master/torchsso/optim/vi.py#L46-L47)

kl_weighting is the tempering parameter.

I hope this helps!

zanklanecek commented 3 years ago

@kazukiosawa

I made a typo there. I want to know what is prior precision (delta) in your article? Tuning prior precision (delta) is therefore same as tuning 1/(prior variance)?

I am sorry, I haven't dug dip into the theory and equations, just want to try VOGN on my classification task and see where it gets.

Thanks!

kazukiosawa commented 3 years ago

@zanklanecek No problem! Yes, tuning the prior precision (delta) is the same as tuning 1/(piror_variance)!

zanklanecek commented 3 years ago

Thanks for now, I will try it out and see where I can get. Adam and OGN are both easily used for training. Hopefuly I can achieve similar results with VOGN which would be wonderful for my research.