Open Senwang98 opened 3 years ago
I find this bug will happen when you resume your model. the more number you resume your model, the smaller learning rate will be
This problem can be solved by replace get_lr()
-> get_last_lr()
Big bug!!
If you resume your model, you will find that the learning rate always *gamma. This bug is terrible!
Also,
Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value
This bug should be solved!!!
see function utility.py/ def load(self, load_dir, epoch=1):
Hello, I met the same bug. We only replace get_lr() -> get_last_lr(), right?
@Doreenqiuyue Yes. To resume the training, you need to change --decay. For example, current epoch is 500, if you want to resume, you need change --decay as '600-800', the reason for remving '200-400' is that 500 > 400 > 200. In this repo, owner didn't using the style of pytorch1.1 to write optmer, so decay will affect lr many times.
@Doreenqiuyue Yes. To resume the training, you need to change --decay. For example, current epoch is 500, if you want to resume, you need change --decay as '600-800', the reason for remving '200-400' is that 500 > 400 > 200. In this repo, owner didn't using the style of pytorch1.1 to write optmer, so decay will affect lr many times.
Thank you for your reply. The training process is unstable. Can we solve it by adjusting the learning rate?
@Doreenqiuyue Sorry, my training is stable, but when I train the RCAN, my final result is not good. So, you are training RCAN? which val set do you use to train?
@Doreenqiuyue Sorry, my training is stable, but when I train the RCAN, my final result is not good. So, you are training RCAN? which val set do you use to train?
I did not train RCAN.
Hi, @Doreenqiuyue Traing unstable can be solved by using the below methods I think:
--lr
.--skip_threshold
can be useful.Hi, @Doreenqiuyue Traing unstable can be solved by using the below methods I think:
- If you finetune, please use smaller
--lr
.- If you train form the scrath, use larger
--skip_threshold
can be useful.
Thank you for your methods. I will try it.
@1187697147 请问--decay这个参数在哪里呢?option.py里只有parser.add_argument('--lr_decay', type=int, default=300, help='learning rate decay per N epochs'),我resume的时候将get_lr() 改成了get_last_lr(),但是学习率也还总是乘了0.5,请问您上面说的改变--decay这个方法是怎么做的呢?
@songyonger option.py文件里面是有--decay的,不可能没有的。 具体而言: 1.把get_lr()->get_last_lr() 2.首先,如果你是一直在训练的,中间不停的话,是不会出现bug的。如果使用了resume,假设当前当前训练到了415epoch,那么你的--decay应该从'200-400-600-800'修改成'600-800'。即每次恢复的时候,需要根据当前的epoch修改--decay,避免重复decay
好的,谢谢!
@songyonger 你好,请问您有成功训练EDSR模型吗,我用作者预训练的模型去测试基准数据集(如Set5)总会报错,我想问下的环境设置是什么样的及option.py的参数该如何设置,我的pytorch版本是1.2.0,cuda10.0,希望可以得到您的回复
Hi, When I try to reproduct RCAN, I use
CUDA_VISIBLE_DEVICES=1 nohup python main.py --model RCAN --save RCAN --scale 2 --save_results --n_resgroups 10 --n_resblocks 20 --patch_size 96 --chop --test_every 500 --batch_size 32 --resume -1 --load RCAN
to resume my model, but the displayed learning rate seems to be wrong(It shoule be 5e-5, but display 1.25e-5), my --decay '200-400-600-800' I think is ok. This bug can also to be seen when the lr reach 200, 201 epoch will be 2.5e-5, 202 epoch It will be 5e-5 . Is there anyone who have met this bug?