open-mmlab / mmdetection3d

OpenMMLab's next-generation platform for general 3D object detection.
https://mmdetection3d.readthedocs.io/en/latest/
Apache License 2.0
5.14k stars 1.52k forks source link

[Bug] BEVFusion LIDAR-Camera traning :torch.distributed.elastic.multiprocessing.errors.ChildFailedError #2677

Open shingszelam opened 1 year ago

shingszelam commented 1 year ago

Prerequisite

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

main branch https://github.com/open-mmlab/mmdetection3d

Environment

ubuntu 22.04 cuda 11.8 gcc 11.3

Reproduces the problem - code sample

Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

tools/train.py FAILED

Failures:

------------------------------------------------------------ The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue? ### Reproduces the problem - command or script bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 1 --cfg-options load_from=/home/dl/csl/mmdetection3d/work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=/home/dl/csl/mmdetection3d/swint-nuimages-pretrained.pth --amp ### Reproduces the problem - error message Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ tools/train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ The previous error is as follows: RuntimeError: /tmp/mmcv/mmcv/ops/csrc/pytorch/cuda/sparse_indice.cu 126 cuda execution failed with error 2 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 10975) of binary: /home/dl/anaconda3/envs/openmmlab/bin/python I followed the solution on CSDN and adjusted the batch size to the minimum value of 1, but the error persists. How can I resolve this issue? ### Additional information _No response_
shingszelam commented 1 year ago

I have resolved the previous issue, and I also changed dist_train.sh to 1, which allowed it to run successfully.

However, a new problem has arisen. When conducting fusion training, I encountered a CUDA overflow error. My device is an RTX 3090 with 24GB of VRAM. The original paper used distributed training with 8 GPUs on the RTX 3090, each with a batch size of four. I have already adjusted the batch size to 1 during training, and I am using the nuscenes-mini dataset. Despite this, I am still encountering the following error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 23.68 GiB total capacity; 12.82 GiB already allocated; 1.87 GiB free; 20.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I would like to know how to resolve this issue.

FrozVolca commented 1 year ago

Same environment, same devices, same error

Mingqj commented 8 months ago

I solved this error. I changed the batch_size from 4 to 2 in train_dataloader in bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py file.

mook0126 commented 4 months ago

I also monitored same issue.