plemeri / InSPyReNet

Official PyTorch implementation of Revisiting Image Pyramid Structure for High Resolution Salient Object Detection (ACCV 2022)
MIT License
449 stars 69 forks source link

Error when resume training #17

Closed yaju1234 closed 1 year ago

yaju1234 commented 1 year ago

I trained the model with my own 62k human dataset similar to DUTS with 60 epochs. The result was not desired so I decided to resume the training with more epochs but I got the following error when resuming the training.

Traceback (most recent call last):
File "run/Train.py", line 178, in train(opt, args) File "run/Train.py", line 140, in train optimizer.step() File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper return wrapped(*args, kwargs) File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper return func(*args, *kwargs) File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, kwargs) File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step eps=group['eps']) File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/functional.py", line 98, in adam param.addcdiv(exp_avg, denom, value=-step_size) RuntimeError: value cannot be converted to type float without overflow: (-9.99425e-08,-1.86689e-11)

Screenshot from 2023-02-14 23-26-27

plemeri commented 1 year ago

Hi, we implemented the resuming part for unexpectecd shutdown or other accidents, so because of the loaded state, all the other parts such as learning rate decay or optimizer, scheduler are coupled with 60 epochs setting.

I would recommend to train from the beginning for 200 epochs setting. Thanks