hongsukchoi / TCMR_RELEASE

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021
MIT License
274 stars 39 forks source link

DataLoader worker (pid 2991): Bus error. #24

Open Mirandl opened 2 years ago

Mirandl commented 2 years ago

Hi, Thank you for your great work! When running your code, I got this error:

`Running TCMR on each person tracklet... 0%| | 0/5 [00:00<?, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). 0%| | 0/5 [00:02<?, ?it/s] Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 779, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.6/multiprocessing/queues.py", line 104, in get if not self._poll(timeout): File "/usr/lib/python3.6/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/usr/lib/python3.6/multiprocessing/connection.py", line 414, in _poll r = wait([self], timeout) File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait ready = selector.select(timeout) File "/usr/lib/python3.6/selectors.py", line 376, in select fd_event_list = self._poll.poll(timeout) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 2991) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/root/data/meilin/TCMR/demo.py", line 377, in main(args) File "/root/data/meilin/TCMR/demo.py", line 157, in main for i, batch in enumerate(crop_dataloader): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 974, in _next_data idx, data = self._get_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 941, in _get_data success, data = self._try_get_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 792, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) RuntimeError: DataLoader worker (pid(s) 2991) exited unexpectedly

Process finished with exit code 1 ` It seems the num_workers need to be adjusted, but I found it's no use... Can you guide me a little bit for this! Thank you!

hongsukchoi commented 2 years ago

the shared memory error message usually indicates that RAM (cpu memory) is insufficient. I remember that normally the experiment took around 50GB.

Try increasing RAM (more/biggr RAM cards), or though not recommended, make swap memory in the disk

Mirandl commented 2 years ago

Hi, thank you very much for your timely reply.

I have tried this, but it still makes the same error. My memory is 61GB and share memory is 64MB. I use 21 CPU and 1 GPU. Should I continue increasing them?

hongsukchoi commented 2 years ago

first check the exaxt required memory by htop. I guess at least 128g is safe!