[Bug] OOM occurs when learning BEVFusion lidar & camera. Distributed learning doesn't seem to be working properly.

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[x] I have read the FAQ documentation but cannot get the expected help.
[X] The bug has not been fixed in the latest version (dev-1.x) or latest version (dev-1.0).

Task

I'm using the official example scripts/configs for the officially supported tasks/models/datasets.

Branch

1.x branch https://github.com/open-mmlab/mmdetection3d/tree/dev-1.x

Environment

:219: RuntimeWarning: scipy._lib.messagestream.MessageStream size changed, may indicate binary incompatibility. Expected 56 from C header, got 64 from PyObject sys.platform: linux Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] CUDA available: True MUSA available: False numpy_random_seed: 2147483648 GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.1, V11.1.105 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.1+cu111 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86 - CuDNN 8.0.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, TorchVision: 0.11.2+cu111 OpenCV: 4.10.0 MMEngine: 0.10.5 MMDetection: 3.3.0 MMDetection3D: 1.4.0+962f093 spconv2.0: True System environment: sys.platform: linux Python: 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] CUDA available: True MUSA available: False numpy_random_seed: 1686915582 GPU 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15: Tesla V100-SXM3-32GB CUDA_HOME: /usr/local/cuda NVCC: Cuda compilation tools, release 11.1, V11.1.105 GCC: gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 PyTorch: 1.10.1+cu111 PyTorch compiling details: PyTorch built with: - GCC 7.3 - C++ Version: 201402 - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740) - OpenMP 201511 (a.k.a. OpenMP 4.5) - LAPACK is enabled (usually provided by MKL) - NNPACK is enabled - CPU capability usage: AVX512 - CUDA Runtime 11.1 - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=co - CuDNN 8.0.5 - Magma 2.5.2 - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-eroverflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, U TorchVision: 0.11.2+cu111 OpenCV: 4.10.0 MMEngine: 0.10.5 Runtime environment: cudnn_benchmark: False mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0} dist_cfg: {'backend': 'nccl'} seed: 1686915582 Distributed launcher: pytorch Distributed training: True GPU number: 8 ------------------------------------------------------------ ### Reproduces the problem - code sample xx ### Reproduces the problem - command or script bash tools/dist_train.sh projects/BEVFusion/configs/bevfusion_lidar-cam_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d.py 8 --cfg-options load_from=work_dirs/bevfusion_lidar_voxel0075_second_secfpn_8xb4-cyclic-20e_nus-3d/epoch_20.pth model.img_backbone.init_cfg.checkpoint=./swint-nuimages-pretrained.pth ### Reproduces the problem - error message File "tools/train.py", line 145, in main() File "tools/train.py", line 141, in main runner.train() File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1789, in train model = self.train_loop.run() # type: ignore File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 98, in run self.run_epoch() File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 115, in run_epoch self.run_iter(idx, data_batch) File "/opt/conda/lib/python3.8/site-packages/mmengine/runner/loops.py", line 131, in run_iter outputs = self.runner.model.train_step( File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 121, in train_step losses = self._run_forward(data, mode='loss') File "/opt/conda/lib/python3.8/site-packages/mmengine/model/wrappers/distributed.py", line 161, in _run_forward results = self(**data, mode=mode) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mfc/user/1628848/pycharm/study/mmdetection3d/mmdet3d/models/detectors/base.py", line 75, in forward return self.loss(inputs, data_samples, **kwargs) File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 292, in loss feats = self.extract_feat(batch_inputs_dict, batch_input_metas) File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 268, in extract_feat img_feature = self.extract_img_feat(imgs, deepcopy(points), File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/bevfusion.py", line 156, in extract_img_feat x = self.view_transform( File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 424, in forward x = super().forward(*args, **kwargs) File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 330, in forward x = self.bev_pool(geom, x) File "/mfc/user/1628848/pycharm/study/mmdetection3d/projects/BEVFusion/bevfusion/depth_lss.py", line 140, in bev_pool x = x[kept] RuntimeError: CUDA out of memory. Tried to allocate 2.11 GiB (GPU 0; 31.73 GiB total capacity; 17.32 GiB already allocated; 990.94 MiB free; 18.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ### Additional information I think Distributed learning doesn't seem to be working properly. I use nuscense dataset. +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 451731 C /opt/conda/bin/python 2216MiB | | 0 N/A N/A 451732 C /opt/conda/bin/python 1246MiB | | 0 N/A N/A 451736 C /opt/conda/bin/python 1104MiB | | 0 N/A N/A 451737 C /opt/conda/bin/python 1120MiB | | 0 N/A N/A 451739 C /opt/conda/bin/python 1184MiB | | 0 N/A N/A 451741 C /opt/conda/bin/python 1222MiB | | 0 N/A N/A 451743 C /opt/conda/bin/python 1176MiB | | 0 N/A N/A 451745 C /opt/conda/bin/python 1114MiB | | 1 N/A N/A 451732 C /opt/conda/bin/python 1768MiB | | 2 N/A N/A 451736 C /opt/conda/bin/python 1768MiB | | 3 N/A N/A 451737 C /opt/conda/bin/python 1768MiB | | 4 N/A N/A 451739 C /opt/conda/bin/python 1768MiB | | 5 N/A N/A 451741 C /opt/conda/bin/python 1768MiB | | 6 N/A N/A 451743 C /opt/conda/bin/python 1768MiB | | 7 N/A N/A 451745 C /opt/conda/bin/python 1648MiB | +-----------------------------------------------------------------------------------------+ OOM issues occur as the GPU focuses on number 0. What's the problem?

open-mmlab / mmdetection3d