junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
22.8k stars 6.29k forks source link

insufficient shared memory - pix2pix #1463

Open mannam95 opened 2 years ago

mannam95 commented 2 years ago

I am getting the below error when I increase the --num_threads>0

I have 48GB GPU, when I pass --num_threads=0 then everything works just that dataloader is slow even though I have more GPU memory. Since data is being supplied slowly.

Error

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
Traceback (most recent call last):
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1306065) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 44, in <module>
    for i, data in enumerate(dataset):  # inner loop within one epoch
  File "/X/X/X/git/pytorch-CycleGAN-and-pix2pix/data/__init__.py", line 90, in __iter__
    for i, data in enumerate(self.dataloader):
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/X/X/miniconda3/envs/pytorch-CycleGAN-and-pix2pix/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 1306065) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 
junyanz commented 1 year ago

It's not related to GPU memory. I am not sure how to fix it. Check out this thread.