Closed ADAM-CT closed 4 years ago
Not sure -- what are you trying to run?
thanks, i solved this question!
Cool! What was the issue?
When I started docker, I forgot the parameter --ipc=host
Got it! You can also use --shm-size=16g
or something similar.
Going to close this!
Here is the error message, why I could complete one epoch of training, the second epoch began to report errors: This might be caused by insufficient shared memory (shm). I can't understand why this mistake happened?
Epoch 0: 6843.771 seconds Epoch start time: 1577064742.170, epoch end time: 1577071585.941 ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). Traceback (most recent call last): File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch data = self.data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.6/queue.py", line 173, in get self.not_empty.wait(remaining) File "/opt/conda/lib/python3.6/threading.py", line 299, in wait gotit = waiter.acquire(True, timeout) File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 63, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 325) is killed by signal: Bus error.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "main_with_runtime.py", line 579, in
main()
File "main_with_runtime.py", line 311, in main
prec1 = validate(val_loader, r, epoch)
File "main_with_runtime.py", line 453, in validate
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 387, in receive_tensors_forward
input = next(self.loader_iter)
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in next
idx, batch = self._get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 512, in _get_batch
success, data = self._try_get_batch()
File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 488, in _try_get_batch
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 325) exited unexpectedly
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).