poking around at the Pix2Pix code I noticed that at times training stops with an error that seems to be related to multiprocessing, probably threading for processing images in parallel. I've set nThreads=1 and that seems to have made the error go away. But I'm wondering if you've seen this in your experiments?
Full trace below:
Traceback (most recent call last):
File "train.py", line 21, in
for i, data in enumerate(dataset):
File "/usr/local/lib/python2.7/dist-packages/future/types/newobject.py", line 71, in next
return type(self).next(self)
File "/home/nbserver/urbanization-patterns/models/pytorch-CycleGAN-and-pix2pix/data/aligned_data_loader_csv.py", line 30, in nex
t
AB, labels, AB_paths = next(self.data_loader_iter)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 206, in next
idx, batch = self.data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
answer_challenge(c, authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
message = connection.recv_bytes(256) # reject large message
IOError: [Errno 104] Connection reset by peer
Hi,
poking around at the Pix2Pix code I noticed that at times training stops with an error that seems to be related to multiprocessing, probably threading for processing images in parallel. I've set nThreads=1 and that seems to have made the error go away. But I'm wondering if you've seen this in your experiments?
Full trace below:
Traceback (most recent call last): File "train.py", line 21, in
for i, data in enumerate(dataset):
File "/usr/local/lib/python2.7/dist-packages/future/types/newobject.py", line 71, in next
return type(self).next(self)
File "/home/nbserver/urbanization-patterns/models/pytorch-CycleGAN-and-pix2pix/data/aligned_data_loader_csv.py", line 30, in nex
t
AB, labels, AB_paths = next(self.data_loader_iter)
File "/usr/local/lib/python2.7/dist-packages/torch/utils/data/dataloader.py", line 206, in next
idx, batch = self.data_queue.get()
File "/usr/lib/python2.7/multiprocessing/queues.py", line 378, in get
return recv()
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/queue.py", line 22, in recv
return pickle.loads(buf)
File "/usr/lib/python2.7/pickle.py", line 1388, in loads
return Unpickler(file).load()
File "/usr/lib/python2.7/pickle.py", line 864, in load
dispatchkey
File "/usr/lib/python2.7/pickle.py", line 1139, in load_reduce
value = func(*args)
File "/usr/local/lib/python2.7/dist-packages/torch/multiprocessing/reductions.py", line 68, in rebuild_storage_fd
fd = multiprocessing.reduction.rebuild_handle(df)
File "/usr/lib/python2.7/multiprocessing/reduction.py", line 155, in rebuild_handle
conn = Client(address, authkey=current_process().authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 175, in Client
answer_challenge(c, authkey)
File "/usr/lib/python2.7/multiprocessing/connection.py", line 432, in answer_challenge
message = connection.recv_bytes(256) # reject large message
IOError: [Errno 104] Connection reset by peer