Closed aosokin closed 3 years ago
Hi Anton, thanks for sharing this (and the related bug yesterday). We are not actively working on this codebase anymore, but would be happy to merge a PR.
Hi @alexpolozov , I also faced the same problem described in this issue. Just did a pull request with the described solution by @aosokin to solve it. I tried and my results improved, before that, when training from a checkpoint they got worse, due to lr reset.
Hi @alexpolozov , I also faced the same problem described in this issue. Just did a pull request with the described solution by @aosokin to solve it. I tried and my results improved, before that, when training from a checkpoint they got worse, due to lr reset.
Hi Muradean, it seems that your code in PR has typo. It should use optimizer.param_groups
, not optimizer.param_group
right code:
lr_scheduler.param_groups = optimizer.param_groups
My bad, I was working on a different repo where I applied that change and improved my results, and when doing this pull request just copy and pasted.
Typo fixed.
Thank you @Muradean and all! Just merged, closing this issue.
Hi, I think I've found a tricky bug.
Line https://github.com/microsoft/rat-sql/blob/f2e00333d425b3bb3b625a89f77f88d015553a6f/ratsql/commands/train.py#L139 when actually loading from a checkpoint file (does not happen when starting fresh training because no checkpoint exists) breaks the connection between
optimizer
andlr_scheduler
as it ends up callingload_state_dict
fromtorch.optim.Optimizer
, which in this line https://github.com/pytorch/pytorch/blob/ee77ccbb6da4e2efd83673e798acf7081bc03564/torch/optim/optimizer.py#L155-L157 creates a new reference to param group. Same in the current pytorch https://github.com/pytorch/pytorch/blob/ec6de6a697668e594a3f1d49e9a87a7c94b6164b/torch/optim/optimizer.py#L185-L187This can be fix by adding
lr_scheduler.param_groups = optimizer.param_groups
after callingsaver.restore
, which is not pretty at all. Maybe there is a better fix?Best, Anton