bug: lr_scheduler does not work after restarting from checkpoint

aosokin commented 3 years ago

Hi, I think I've found a tricky bug.

Line https://github.com/microsoft/rat-sql/blob/f2e00333d425b3bb3b625a89f77f88d015553a6f/ratsql/commands/train.py#L139 when actually loading from a checkpoint file (does not happen when starting fresh training because no checkpoint exists) breaks the connection between optimizer and lr_scheduler as it ends up calling load_state_dict from torch.optim.Optimizer, which in this line https://github.com/pytorch/pytorch/blob/ee77ccbb6da4e2efd83673e798acf7081bc03564/torch/optim/optimizer.py#L155-L157 creates a new reference to param group. Same in the current pytorch https://github.com/pytorch/pytorch/blob/ec6de6a697668e594a3f1d49e9a87a7c94b6164b/torch/optim/optimizer.py#L185-L187

This can be fix by adding lr_scheduler.param_groups = optimizer.param_groups after calling saver.restore, which is not pretty at all. Maybe there is a better fix?

Best, Anton

alexpolozov commented 3 years ago

Hi Anton, thanks for sharing this (and the related bug yesterday). We are not actively working on this codebase anymore, but would be happy to merge a PR.

PedroEstevesPT commented 3 years ago

Hi @alexpolozov , I also faced the same problem described in this issue. Just did a pull request with the described solution by @aosokin to solve it. I tried and my results improved, before that, when training from a checkpoint they got worse, due to lr reset.

huybery commented 3 years ago

Hi @alexpolozov , I also faced the same problem described in this issue. Just did a pull request with the described solution by @aosokin to solve it. I tried and my results improved, before that, when training from a checkpoint they got worse, due to lr reset.

Hi Muradean, it seems that your code in PR has typo. It should use optimizer.param_groups, not optimizer.param_group

right code:

lr_scheduler.param_groups = optimizer.param_groups

PedroEstevesPT commented 3 years ago

My bad, I was working on a different repo where I applied that change and improved my results, and when doing this pull request just copy and pasted.

Typo fixed.

alexpolozov commented 3 years ago

Thank you @Muradean and all! Just merged, closing this issue.

microsoft / rat-sql

bug: lr_scheduler does not work after restarting from checkpoint #45