I am trying to re-train your model. I have been able to pre-train the model on SyntheticSR dataset for 500k iterations and everything works fine. But when I switch to continue finetuning the model on BurstSR dataset, I encountered with training crash.
Here is the error log:
.......
[train: 1, 1000 / 1000] FPS: 2.4 (10.7) , Loss/total: 0.03915 , Loss/rgb: 0.03915 , Loss/raw/rgb: 0.00391 , Stat/psnr: 46.35584
Training crashed at epoch 1
Traceback for the error!
Traceback (most recent call last):
File "/cluster/.../deep-rep/trainers/base_trainer.py", line 69, in train
self.train_epoch()
File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 95, in train_epoch
self.cycle_dataset(loader)
File "/cluster/.../deep-rep/trainers/simple_trainer.py", line 66, in cycle_dataset
for i, data in enumerate(loader, 1):
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 435, in __next__
data = self._next_data()
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1057, in _next_data
self._shutdown_workers()
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/dataloader.py", line 1177, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/popen_fork.py", line 44, in wait
if not wait([self.sentinel], timeout):
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python/3.8.5/x86_64/lib64/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/cluster/apps/nss/gcc-6.3.0/python_gpu/3.8.5/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 83465) is killed by signal: Terminated.
Restarting training from last epoch ...
....
It seems to be related to the num_worker setting in the DataLoader. Your default setting is settings.num_workers = 8, for me, I run the code on a single GPU, so I correspondingly reduce the num_worker to 4. The error occurs when the num_workers is larger than 0, but setting it to 0 will be too slow for training. I am confused about the behavior of the code since everything works fine for SyntheticSR data.
Do you have any idea what might cause this problem? Thank you very much!
Hello,
I am trying to re-train your model. I have been able to pre-train the model on SyntheticSR dataset for 500k iterations and everything works fine. But when I switch to continue finetuning the model on BurstSR dataset, I encountered with training crash.
Here is the error log:
It seems to be related to the
num_worker
setting in the DataLoader. Your default setting issettings.num_workers = 8
, for me, I run the code on a single GPU, so I correspondingly reduce the num_worker to 4. The error occurs when the num_workers is larger than 0, but setting it to 0 will be too slow for training. I am confused about the behavior of the code since everything works fine for SyntheticSR data.Do you have any idea what might cause this problem? Thank you very much!
Best regards, Shijian