Open shingszelam opened 1 year ago
I have resolved the previous issue, and I also changed dist_train.sh to 1, which allowed it to run successfully.
However, a new problem has arisen. When conducting fusion training, I encountered a CUDA overflow error. My device is an RTX 3090 with 24GB of VRAM. The original paper used distributed training with 8 GPUs on the RTX 3090, each with a batch size of four. I have already adjusted the batch size to 1 during training, and I am using the nuscenes-mini dataset. Despite this, I am still encountering the following error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.38 GiB (GPU 0; 23.68 GiB total capacity; 12.82 GiB already allocated; 1.87 GiB free; 20.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I would like to know how to resolve this issue.
Same environment, same devices, same error
I solved this error. I changed the batch_size from 4 to 2 in train_dataloader in bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py file.
I also monitored same issue.
Prerequisite
Task
I'm using the official example scripts/configs for the officially supported tasks/models/datasets.
Branch
main branch https://github.com/open-mmlab/mmdetection3d
Environment
ubuntu 22.04 cuda 11.8 gcc 11.3
Reproduces the problem - code sample
Hello, I am using the BEVFusion project in mmdetection3d. Initially, I was able to train with single LIDAR data successfully. However, when I attempted to train the lidar-camera fusion model, an error occurred with the following message: raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
tools/train.py FAILED
Failures: