Terrible resume bug!! - Githubissues

sanghyun-son / EDSR-PyTorch

PyTorch version of the paper 'Enhanced Deep Residual Networks for Single Image Super-Resolution' (CVPRW 2017)

MIT License

2.44k stars 672 forks source link

Terrible resume bug!! #296

Open Senwang98 opened 3 years ago

Senwang98 commented 3 years ago

Hi, When I try to reproduct RCAN, I use CUDA_VISIBLE_DEVICES=1 nohup python main.py --model RCAN --save RCAN --scale 2 --save_results --n_resgroups 10 --n_resblocks 20 --patch_size 96 --chop --test_every 500 --batch_size 32 --resume -1 --load RCAN to resume my model, but the displayed learning rate seems to be wrong(It shoule be 5e-5, but display 1.25e-5), my --decay '200-400-600-800' I think is ok. This bug can also to be seen when the lr reach 200, 201 epoch will be 2.5e-5, 202 epoch It will be 5e-5 . Is there anyone who have met this bug?

Senwang98 commented 3 years ago

I find this bug will happen when you resume your model. the more number you resume your model, the smaller learning rate will be

Senwang98 commented 3 years ago

Snipaste_2020-12-22_14-04-53 This problem can be solved by replace get_lr() -> get_last_lr()

Senwang98 commented 3 years ago

Big bug!!

If you resume your model, you will find that the learning rate always *gamma. This bug is terrible!

Also, Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value This bug should be solved!!! see function utility.py/ def load(self, load_dir, epoch=1):

Doreenqiuyue commented 3 years ago

Hello, I met the same bug. We only replace get_lr() -> get_last_lr(), right?

Senwang98 commented 3 years ago

@Doreenqiuyue Yes. To resume the training, you need to change --decay. For example, current epoch is 500, if you want to resume, you need change --decay as '600-800', the reason for remving '200-400' is that 500 > 400 > 200. In this repo, owner didn't using the style of pytorch1.1 to write optmer, so decay will affect lr many times.

Doreenqiuyue commented 3 years ago

@Doreenqiuyue Yes. To resume the training, you need to change --decay. For example, current epoch is 500, if you want to resume, you need change --decay as '600-800', the reason for remving '200-400' is that 500 > 400 > 200. In this repo, owner didn't using the style of pytorch1.1 to write optmer, so decay will affect lr many times.

Thank you for your reply. The training process is unstable. Can we solve it by adjusting the learning rate?

Senwang98 commented 3 years ago

@Doreenqiuyue Sorry, my training is stable, but when I train the RCAN, my final result is not good. So, you are training RCAN? which val set do you use to train?

Doreenqiuyue commented 3 years ago

@Doreenqiuyue Sorry, my training is stable, but when I train the RCAN, my final result is not good. So, you are training RCAN? which val set do you use to train?

I did not train RCAN.

Senwang98 commented 3 years ago

Hi, @Doreenqiuyue Traing unstable can be solved by using the below methods I think:

If you finetune, please use smaller --lr.
If you train form the scrath, use larger --skip_threshold can be useful.

Doreenqiuyue commented 3 years ago

Hi, @Doreenqiuyue Traing unstable can be solved by using the below methods I think:

If you finetune, please use smaller --lr.

If you train form the scrath, use larger --skip_threshold can be useful.

Thank you for your methods. I will try it.

songyonger commented 3 years ago

@1187697147 请问--decay这个参数在哪里呢？option.py里只有parser.add_argument('--lr_decay', type=int, default=300, help='learning rate decay per N epochs')，我resume的时候将get_lr() 改成了get_last_lr()，但是学习率也还总是乘了0.5，请问您上面说的改变--decay这个方法是怎么做的呢？

Senwang98 commented 3 years ago

@songyonger option.py文件里面是有--decay的，不可能没有的。具体而言： 1.把get_lr()->get_last_lr() 2.首先，如果你是一直在训练的，中间不停的话，是不会出现bug的。如果使用了resume，假设当前当前训练到了415epoch，那么你的--decay应该从'200-400-600-800'修改成'600-800'。即每次恢复的时候，需要根据当前的epoch修改--decay，避免重复decay

songyonger commented 3 years ago

好的，谢谢！

Liiiiaictx commented 9 months ago

@songyonger 你好，请问您有成功训练EDSR模型吗，我用作者预训练的模型去测试基准数据集（如Set5）总会报错，我想问下的环境设置是什么样的及option.py的参数该如何设置，我的pytorch版本是1.2.0，cuda10.0，希望可以得到您的回复