rdevon / cortex

A machine learning library for PyTorch
BSD 3-Clause "New" or "Revised" License
92 stars 25 forks source link

Dataloader worker not releasing on interrupts #47

Open rdevon opened 6 years ago

rdevon commented 6 years ago

So right now, there is an init_fn being passed to the DataLoader to avoid a terminal flood when you do a keyboard interrupt. Normally, pytorch doesn't handle this well, but I fit in a hack to treat sigint as sigign. However, there is a side effect, that is the workers get terminated later, not when you sigint:

xception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f9db1d30fd0>>
Traceback (most recent call last):
  File "/home/devon/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
    self._shutdown_workers()
  File "/home/devon/.local/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 328, in _shutdown_workers
    self.worker_result_queue.get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/devon/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 70, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 493, in Client
    answer_challenge(c, authkey)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError:
rdevon commented 6 years ago

Yes, this issue is still relevant. I'm not sure there's a way to fix it without modifying the pytorch data loader.