msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

insufficient shared memory (shm) #28

Closed ADAM-CT closed 4 years ago

ADAM-CT commented 4 years ago

Here is the error message, why I could complete one epoch of training, the second epoch began to report errors: This might be caused by insufficient shared memory (shm). I can't understand why this mistake happened?

Epoch 0: 6843.771 seconds Epoch start time: 1577064742.170, epoch end time: 1577071585.941 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch data = self.data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.6/queue.py", line 173, in get self.not_empty.wait(remaining) File "/opt/conda/lib/python3.6/threading.py", line 299, in wait gotit = waiter.acquire(True, timeout) File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 325) is killed by signal: Bus error.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main_with_runtime.py", line 579, in main() File "main_with_runtime.py", line 311, in main prec1 = validate(val_loader, r, epoch) File "main_with_runtime.py", line 453, in validate r.run_forward() File "../runtime.py", line 498, in run_forward self.receive_tensors_forward() File "../runtime.py", line 387, in receive_tensors_forward input = next(self.loader_iter) File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in next idx, batch = self._get_batch() File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 512, in _get_batch success, data = self._try_get_batch() File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 488, in _try_get_batch raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) RuntimeError: DataLoader worker (pid(s) 325) exited unexpectedly ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

deepakn94 commented 4 years ago

Not sure -- what are you trying to run?

ADAM-CT commented 4 years ago

thanks, i solved this question!

deepakn94 commented 4 years ago

Cool! What was the issue?

ADAM-CT commented 4 years ago

When I started docker, I forgot the parameter --ipc=host

deepakn94 commented 4 years ago

Got it! You can also use --shm-size=16g or something similar. Going to close this!