Error in beginning of the training

erdemarpaci commented 4 years ago

When I try to start training I have encountered this error:

2020-08-19 10:18:42,483 - INFO - set log dir as ./logdir 2020-08-19 10:18:42,483 - INFO - set model dir as ./model 2020-08-19 10:18:55,600 - INFO - train_derain--l1_loss:0.01721 mask_loss:0.02403 ssim_loss:0.9961 all_loss:0.04513 lr:0.0005 step:4e+04 2020-08-19 10:18:57,353 - INFO - val_derain--l1_loss:0.01398 mask_loss:0.01736 ssim_loss:0.9957 all_loss:0.03568 lr:0.0005 step:4e+04 2020-08-19 10:18:58,377 - INFO - save image as step_40000 2020-08-19 10:18:58,413 - INFO - save model as step_40000 Exception ignored in: <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f9809fadfd0>> Traceback (most recent call last): File "/home/erdem/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 399, in del self._shutdown_workers() File "/home/erdem/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 378, in _shutdown_workers self.worker_result_queue.get() File "/usr/lib/python3.6/multiprocessing/queues.py", line 337, in get return _ForkingPickler.loads(res) File "/home/erdem/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 151, in rebuild_storage_fd fd = df.detach() File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach with _resource_sharer.get_connection(self._id) as conn: File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection c = Client(address, authkey=process.current_process().authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client answer_challenge(c, authkey) File "/usr/lib/python3.6/multiprocessing/connection.py", line 737, in answer_challenge response = connection.recv_bytes(256) # reject large message File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes buf = self._recv_bytes(maxlength) File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes buf = self._recv(4) File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv chunk = read(handle, remaining) ConnectionResetError: [Errno 104] Connection reset by peer

But when I try to train again in different times, sometimes it gave no error but sometimes gave different ConnectionResetError. I can't understand why it is happening. Since I am working with only one GPU I set my batch size and num_workers to 1. When I set batch size bigger than 1, it gave "CUDA error:out of memory". Is the error caused by my GPU? I hope you can help me, thank you.

I am using python 3.6, pytorch0.4.1.

stevewongv commented 4 years ago

Yes, the memory of your GPU is no enough to train this network. We used 8 Titan V with 12G memory to train SPANet.

erdemarpaci commented 4 years ago

Could working at Colab be the solution? Thanks for your reply.

stevewongv commented 4 years ago

I have not tried it on Colab. So maybe you can try it.

stevewongv / SPANet

Error in beginning of the training #17