junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
23.07k stars 6.31k forks source link

multiprocessing issue with nThreads>1 #23

Closed adrianalbert closed 6 years ago

adrianalbert commented 7 years ago

Hi,

poking around at the Pix2Pix code I noticed that at times training stops with an error that seems to be related to multiprocessing, probably threading for processing images in parallel. I've set nThreads=1 and that seems to have made the error go away. But I'm wondering if you've seen this in your experiments?

Full trace below:

Traceback (most recent call last): File "train.py", line 21, in for i, data in enumerate(dataset): File "/usr/local/lib/python2.7/dist-packages/future/types/newobject.py", line 71, in next return type(self).next(self) File "/home/nbserver/urbanization-patterns/models/pytorch-CycleGAN-and-pix2pix/data/aligned_data_loader_csv.py", line 30, in nex t AB, labels, AB_paths = next(self.data_loader_iter) File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 206, in next idx, batch = self.data_queue.get() File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get return recv() File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv return pickle.loads(buf) File "/usr/lib/python2.7/pickle.py", line 1388, in loads return Unpickler(file).load() File "/usr/lib/python2.7/pickle.py", line 864, in load dispatchkey File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce value = func(*args) File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd fd = multiprocessing.reduction.rebuild_handle(df) File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle conn = Client(address, authkey=current_process().authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client answer_challenge(c, authkey) File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge message = connection.recv_bytes(256) # reject large message IOError: [Errno 104] Connection reset by peer

junyanz commented 7 years ago

I haven't seen this bug before. Has it been resolved?

ufoym commented 7 years ago

Try docker --ipc=host if you are using docker.