nianticlabs / simplerecon

[ECCV 2022] SimpleRecon: 3D Reconstruction Without 3D Convolutions
Other
1.28k stars 120 forks source link

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm). #3

Closed pablovela5620 closed 1 year ago

pablovela5620 commented 1 year ago

I'm trying to run the test script on the 7scenes dataset, I've tried with both the standard and fast_cost_volume versions. This is the command I'm running (after following the preprocessing steps for 7scenes as well as tuple generation)

CUDA_VISIBLE_DEVICES=0 python test.py
--name HERO_MODEL 
--output_base_path OUTPUT_PATH 
--config_file configs/models/hero_model.yaml 
--load_weights_from_checkpoint weights/hero_model.ckpt
--data_config configs/data/7scenes_default.yaml
--num_workers 8
--fast_cost_volume
--batch_size 2;

I'm using a machine with 3 A6000 (on a vscode devcontainer) so the shared memory aspect seems weird (considering I have >40gb of vram)

image

This is the exact error I get

################################################################################
##################### 7Scenes Dataset, number of scans: 13 #####################
################################################################################

INFO - 2022-09-07 19:20:02,603 - helpers - Loading pretrained weights from url (https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-effv2-weights/tf_efficientnetv2_s_21ft1k-d7dafa41.pth)
################################################################################
########################## Using FeatureVolumeManager ##########################
 Number of source views:      7
 Using all metadata.
 Number of channels:          [202, 128, 128, 1]
################################################################################

################################################################################
########################## Using FeatureVolumeManager ##########################
 Number of source views:      7
 Using all metadata.
 Number of channels:          [202, 128, 128, 1]
################################################################################

######################################################################################################## Using FastFeatureVolumeManager ########################
 Number of source views:      7
 Using all metadata.
 Number of channels:          [202, 128, 128, 1]
################################################################################
  0%|                                                                                                                                   | 0/13 [00:00<?, ?it/sERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).                             | 0/42 [00:00<?, ?it/s]
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
  0%|                                                                                                                                   | 0/42 [00:00<?, ?it/s]
  0%|                                                                                                                                   | 0/13 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/envs/simplerecon/lib/python3.9/multiprocessing/queues.py", line 113, in get
    if not self._poll(timeout):
  File "/opt/conda/envs/simplerecon/lib/python3.9/multiprocessing/connection.py", line 262, in poll
    return self._poll(timeout)
  File "/opt/conda/envs/simplerecon/lib/python3.9/multiprocessing/connection.py", line 429, in _poll
    r = wait([self], timeout)  File "/opt/conda/envs/simplerecon/lib/python3.9/multiprocessing/connection.py", line 936, in wait
    ready = selector.select(timeout)
  File "/opt/conda/envs/simplerecon/lib/python3.9/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()RuntimeError: DataLoader worker (pid 32106) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/volume/test.py", line 479, in <module>
    main(opts)
  File "/volume/test.py", line 263, in main
    for batch_ind, batch in enumerate(tqdm(dataloader)):
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/envs/simplerecon/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 32106) exited unexpectedly
pablovela5620 commented 1 year ago

Looks like setting --num_workers 0 fixes the issue for me, though I still don't fully understand why

mohammed-amr commented 1 year ago

Sounds like this is a shared memory limitation on your system.

Fortunately this should be an easy fix if you increase your system's shared memory limit: https://discuss.pytorch.org/t/training-crashes-due-to-insufficient-shared-memory-shm-nn-dataparallel/26396

If you're using docker then you should check: https://github.com/pytorch/pytorch#docker-image

and in particular the --shm-size flag.

pablovela5620 commented 1 year ago

Yep looks like this was the issue! Thanks for the help

mohammed-amr commented 1 year ago

Great! Welcome!