NCCL stuck at collective operations timeout with multiple GPUs

Hi, @sshaoshuai @chenshi3 @jihanyang @yukang2017 @djiajunustc @Gus-Guo @Cedarch @acivgin1

I have stuck at the NCCL problem. If i use multiple GPUs at single machine, it comes at a NCCL communication error: [Rank 1] Watchdog caught collective operation timeout:

Location: The location that stuck and raise timeout error, is random. But, most of time is located at the dist.all_reduce() or some xx.alltogether. It seems as the NCCL communication problem.

Special Features: If the problem run normally, the utilization of GPU will change with time. But if the program stuck at this kind of problem, the utilization of all used GPUs are 100%, till the timeout error raised.

Some Trial: I have do some trial to find what happen and help your developer to save time to locate the problem. The most meaningful finding is telling below:

At pcdet/utils/common_utils.py, we have the code:

if mp.get_start_method(allow_none=True) is None:
        mp.set_start_method('spawn')

If delete these two line code, it can run normally without any NCCL timeout error for several hours. And if use torch.multiprocessing, the NCCL timeout error will be met within 1 hours in general.

And, If DISABLE the torch.multiprocessing, the ETA time of program reach 2x higher, similar to running with the single GPU (may be relative to the number of GPU as i am using 2). And only 43G memory will be used instead of about 133G will used if ENABLE torch.multiprocessing.

In conculsion, the problem seems relative to the mixed use of both torchrun and torch.multiprocessing. The mixed use will cause higher memory usage, and reach higher speed, but will face NCCL error collective operations timeout at random within short time. Only use torchrun seems get rid of the NCCL timeout error, but run slower, and use much less memory.

Below are some detailed log about this issue. Environment:

Tue Jul 18 20:06:39 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:67:00.0 Off |                  N/A |
| 62%   54C    P2             142W / 350W |  10891MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:68:00.0 Off |                  N/A |
| 81%   59C    P2             156W / 350W |  10816MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1644      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A   1890568      C   ........./.conda/envs/pcdet/bin/python    10874MiB |
|    1   N/A  N/A      1644      G   /usr/lib/xorg/Xorg                           14MiB |
|    1   N/A  N/A   1890569      C   ........./.conda/envs/pcdet/bin/python    10788MiB |
+---------------------------------------------------------------------------------------+

$ conda list | grep torch
pytorch                   2.0.1           py3.11_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                2.0.2               py311_cu118    pytorch
torchtriton               2.0.0                     py311    pytorch
torchvision               0.15.2              py311_cu118    pytorch

$ conda list | grep spconv
spconv-cu118              2.3.6                    pypi_0    pypi

$ conda list | grep python
ipython                   8.14.0                   pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
opencv-python             4.8.0.74                 pypi_0    pypi
python                    3.11.4               h955ad1f_0  
python-dateutil           2.8.2                    pypi_0    pypi
python-json-logger        2.0.7                    pypi_0    pypi

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

Log: The complete log: openpcdet_2gpu_mixed_torchrun_mp_log.txt

Show the most important parts below.

~/tmp/OpenPCDet/tools$ NCCL_DEBUG=INFO scripts/dist_train.sh 2 --cfg_file ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml  --stop-if-nan --workers 8
+ NGPUS=2
+ PY_ARGS='--cfg_file ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml --stop-if-nan --workers 8'
+ true
+ PORT=30889
++ nc -z 127.0.0.1 30889
++ echo 1
+ status=1
+ '[' 1 '!=' 0 ']'
+ break
+ echo 30889
30889
+ torchrun --nproc_per_node=2 --rdzv_endpoint=localhost:30889 train.py --launcher pytorch --cfg_file ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml --stop-if-nan --workers 8
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
2023-07-18 19:25:02,887   INFO  **********************Start logging**********************
2023-07-18 19:25:02,887   INFO  CUDA_VISIBLE_DEVICES=ALL
2023-07-18 19:25:02,887   INFO  Training in distributed mode : total_batch_size: 8
2023-07-18 19:25:02,887   INFO  cfg_file         ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml
2023-07-18 19:25:02,887   INFO  batch_size       4
2023-07-18 19:25:02,887   INFO  epochs           20
2023-07-18 19:25:02,887   INFO  workers          8
2023-07-18 19:25:02,887   INFO  extra_tag        default
2023-07-18 19:25:02,887   INFO  ckpt             None
2023-07-18 19:25:02,887   INFO  pretrained_model None
2023-07-18 19:25:02,887   INFO  launcher         pytorch
2023-07-18 19:25:02,887   INFO  tcp_port         18888
2023-07-18 19:25:02,887   INFO  sync_bn          False
2023-07-18 19:25:02,887   INFO  fix_random_seed  False
2023-07-18 19:25:02,887   INFO  ckpt_save_interval 1
2023-07-18 19:25:02,887   INFO  local_rank       0
2023-07-18 19:25:02,888   INFO  max_ckpt_save_num 30
2023-07-18 19:25:02,888   INFO  merge_all_iters_to_one_epoch False
2023-07-18 19:25:02,888   INFO  set_cfgs         None
2023-07-18 19:25:02,888   INFO  autoscale_lr     False
2023-07-18 19:25:02,888   INFO  max_waiting_mins 0
2023-07-18 19:25:02,888   INFO  start_epoch      0
2023-07-18 19:25:02,888   INFO  num_epochs_to_eval 0
2023-07-18 19:25:02,888   INFO  save_to_file     False
2023-07-18 19:25:02,888   INFO  use_tqdm_to_record False
2023-07-18 19:25:02,888   INFO  logger_iter_interval 50
2023-07-18 19:25:02,888   INFO  ckpt_save_time_interval 300
2023-07-18 19:25:02,888   INFO  wo_gpu_stat      False
2023-07-18 19:25:02,888   INFO  use_amp          False
2023-07-18 19:25:02,888   INFO  find_unused_parameters False
2023-07-18 19:25:02,888   INFO  stop_if_nan      True
2023-07-18 19:25:02,888   INFO  cfg.ROOT_DIR: /home/me/tmp/OpenPCDet
2023-07-18 19:25:02,888   INFO  cfg.LOCAL_RANK: 0
2023-07-18 19:25:02,888   INFO  cfg.CLASS_NAMES: ['car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone']
2023-07-18 19:25:02,888   INFO  ----------- DATA_CONFIG -----------
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.DATASET: NuScenesDataset
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.DATA_PATH: ../data/nuscenes
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.VERSION: v1.0-trainval
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.MAX_SWEEPS: 10
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.PRED_VELOCITY: True
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.SET_NAN_VELOCITY_TO_ZEROS: True
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.FILTER_MIN_POINTS_IN_GT: 1
2023-07-18 19:25:02,888   INFO  ----------- DATA_SPLIT -----------
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.DATA_SPLIT.train: train
2023-07-18 19:25:02,888   INFO  cfg.DATA_CONFIG.DATA_SPLIT.test: val
2023-07-18 19:25:02,888   INFO  ----------- INFO_PATH -----------
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.INFO_PATH.train: ['nuscenes_infos_10sweeps_train.pkl']
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.INFO_PATH.test: ['nuscenes_infos_10sweeps_val.pkl']
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.POINT_CLOUD_RANGE: [-54.0, -54.0, -5.0, 54.0, 54.0, 3.0]
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.BALANCED_RESAMPLING: True
2023-07-18 19:25:02,889   INFO  ----------- DATA_AUGMENTOR -----------
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.DATA_AUGMENTOR.DISABLE_AUG_LIST: ['placeholder']
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.DATA_AUGMENTOR.AUG_CONFIG_LIST: [{'NAME': 'gt_sampling', 'DB_INFO_PATH': ['nuscenes_dbinfos_10sweeps_withvelo.pkl'], 'PREPARE': {'filter_by_min_points': ['car:5', 'truck:5', 'construction_vehicle:5', 'bus:5', 'trailer:5', 'barrier:5', 'motorcycle:5', 'bicycle:5', 'pedestrian:5', 'traffic_cone:5']}, 'SAMPLE_GROUPS':
 ['car:2', 'truck:3', 'construction_vehicle:7', 'bus:4', 'trailer:6', 'barrier:2', 'motorcycle:6', 'bicycle:6', 'pedestrian:2', 'traffic_cone:2'], 'NUM_POINT_FEATURES': 5, 'DATABASE_WITH_FAKELIDAR': False, 'REMOVE_EXTRA_WIDTH': [0.0, 0.0, 0.0], 'LIMIT_WHOLE_SCENE': True}, {'NAME': 'random_world_flip', 'ALONG_AXIS_LIST': ['x', 'y']}, {'NAME': 'random_world_rotati
on', 'WORLD_ROT_ANGLE': [-0.78539816, 0.78539816]}, {'NAME': 'random_world_scaling', 'WORLD_SCALE_RANGE': [0.9, 1.1]}, {'NAME': 'random_world_translation', 'NOISE_TRANSLATE_STD': [0.5, 0.5, 0.5]}]
2023-07-18 19:25:02,889   INFO  ----------- POINT_FEATURE_ENCODING -----------
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.encoding_type: absolute_coordinates_encoding
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.used_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp']
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.src_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp']
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG.DATA_PROCESSOR: [{'NAME': 'mask_points_and_boxes_outside_range', 'REMOVE_OUTSIDE_BOXES': True}, {'NAME': 'shuffle_points', 'SHUFFLE_ENABLED': {'train': True, 'test': True}}, {'NAME': 'transform_points_to_voxels', 'VOXEL_SIZE': [0.075, 0.075, 0.2], 'MAX_POINTS_PER_VOXEL': 10, 'MAX_NUMBER_OF_VOXELS': {'train': 120000
, 'test': 160000}}]
2023-07-18 19:25:02,889   INFO  cfg.DATA_CONFIG._BASE_CONFIG_: cfgs/dataset_configs/nuscenes_dataset.yaml
2023-07-18 19:25:02,889   INFO  ----------- MODEL -----------
2023-07-18 19:25:02,889   INFO  cfg.MODEL.NAME: CenterPoint
2023-07-18 19:25:02,889   INFO  ----------- VFE -----------
2023-07-18 19:25:02,889   INFO  cfg.MODEL.VFE.NAME: MeanVFE
2023-07-18 19:25:02,889   INFO  ----------- BACKBONE_3D -----------
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_3D.NAME: VoxelResBackBone8x
2023-07-18 19:25:02,889   INFO  ----------- MAP_TO_BEV -----------
2023-07-18 19:25:02,889   INFO  cfg.MODEL.MAP_TO_BEV.NAME: HeightCompression
2023-07-18 19:25:02,889   INFO  cfg.MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 256
2023-07-18 19:25:02,889   INFO  ----------- BACKBONE_2D -----------
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_2D.NAME: BaseBEVBackbone
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_2D.LAYER_NUMS: [5, 5]
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_2D.LAYER_STRIDES: [1, 2]
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_2D.NUM_FILTERS: [128, 256]
2023-07-18 19:25:02,889   INFO  cfg.MODEL.BACKBONE_2D.UPSAMPLE_STRIDES: [1, 2]
2023-07-18 19:25:02,890   INFO  cfg.MODEL.BACKBONE_2D.NUM_UPSAMPLE_FILTERS: [256, 256]
2023-07-18 19:25:02,890   INFO  ----------- DENSE_HEAD -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.NAME: CenterHead
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.CLASS_AGNOSTIC: False
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.CLASS_NAMES_EACH_HEAD: [['car'], ['truck', 'construction_vehicle'], ['bus', 'trailer'], ['barrier'], ['motorcycle', 'bicycle'], ['pedestrian', 'traffic_cone']]
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SHARED_CONV_CHANNEL: 64
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.USE_BIAS_BEFORE_NORM: True
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.NUM_HM_CONV: 2
2023-07-18 19:25:02,890   INFO  ----------- SEPARATE_HEAD_CFG -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_ORDER: ['center', 'center_z', 'dim', 'rot', 'vel']
2023-07-18 19:25:02,890   INFO  ----------- HEAD_DICT -----------
2023-07-18 19:25:02,890   INFO  ----------- center -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.center.out_channels: 2
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.center.num_conv: 2
2023-07-18 19:25:02,890   INFO  ----------- center_z -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.center_z.out_channels: 1
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.center_z.num_conv: 2
2023-07-18 19:25:02,890   INFO  ----------- dim -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.dim.out_channels: 3
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.dim.num_conv: 2
2023-07-18 19:25:02,890   INFO  ----------- rot -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.rot.out_channels: 2
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.rot.num_conv: 2
2023-07-18 19:25:02,890   INFO  ----------- vel -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.vel.out_channels: 2
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.SEPARATE_HEAD_CFG.HEAD_DICT.vel.num_conv: 2
2023-07-18 19:25:02,890   INFO  ----------- TARGET_ASSIGNER_CONFIG -----------
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.FEATURE_MAP_STRIDE: 8
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NUM_MAX_OBJS: 500
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.GAUSSIAN_OVERLAP: 0.1
2023-07-18 19:25:02,890   INFO  cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.MIN_RADIUS: 2
2023-07-18 19:25:02,891   INFO  ----------- LOSS_CONFIG -----------
2023-07-18 19:25:02,891   INFO  ----------- LOSS_WEIGHTS -----------
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.cls_weight: 1.0
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.loc_weight: 0.25
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.code_weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.2, 0.2, 1.0, 1.0]
2023-07-18 19:25:02,891   INFO  ----------- POST_PROCESSING -----------
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.SCORE_THRESH: 0.1
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.POST_CENTER_LIMIT_RANGE: [-61.2, -61.2, -10.0, 61.2, 61.2, 10.0]
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.MAX_OBJ_PER_SAMPLE: 500
2023-07-18 19:25:02,891   INFO  ----------- NMS_CONFIG -----------
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.NMS_CONFIG.NMS_TYPE: nms_gpu
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.NMS_CONFIG.NMS_THRESH: 0.2
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.NMS_CONFIG.NMS_PRE_MAXSIZE: 1000
2023-07-18 19:25:02,891   INFO  cfg.MODEL.DENSE_HEAD.POST_PROCESSING.NMS_CONFIG.NMS_POST_MAXSIZE: 83
2023-07-18 19:25:02,891   INFO  ----------- POST_PROCESSING -----------
2023-07-18 19:25:02,891   INFO  cfg.MODEL.POST_PROCESSING.RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
2023-07-18 19:25:02,891   INFO  cfg.MODEL.POST_PROCESSING.EVAL_METRIC: kitti
2023-07-18 19:25:02,891   INFO  ----------- OPTIMIZATION -----------
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.BATCH_SIZE_PER_GPU: 4
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.NUM_EPOCHS: 20
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.OPTIMIZER: adam_onecycle
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.LR: 0.001
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.WEIGHT_DECAY: 0.01
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.MOMENTUM: 0.9
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.MOMS: [0.95, 0.85]
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.PCT_START: 0.4
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.DIV_FACTOR: 10
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.DECAY_STEP_LIST: [35, 45]
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.LR_DECAY: 0.1
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.LR_CLIP: 1e-07
2023-07-18 19:25:02,891   INFO  cfg.OPTIMIZATION.LR_WARMUP: False
2023-07-18 19:25:02,892   INFO  cfg.OPTIMIZATION.WARMUP_EPOCH: 1
2023-07-18 19:25:02,892   INFO  cfg.OPTIMIZATION.GRAD_NORM_CLIP: 10
2023-07-18 19:25:02,892   INFO  cfg.TAG: cbgs_voxel0075_res3d_centerpoint
2023-07-18 19:25:02,892   INFO  cfg.EXP_GROUP_PATH: cfgs/nuscenes_models
2023-07-18 19:25:02,898   INFO  ----------- Create dataloader & network & optimizer -----------
2023-07-18 19:25:04,985   INFO  Database filter by min points car: 339949 => 294532
2023-07-18 19:25:04,997   INFO  Database filter by min points truck: 65262 => 60344
2023-07-18 19:25:04,999   INFO  Database filter by min points construction_vehicle: 11050 => 10589
2023-07-18 19:25:05,000   INFO  Database filter by min points bus: 12286 => 11619
2023-07-18 19:25:05,002   INFO  Database filter by min points trailer: 19202 => 17934
2023-07-18 19:25:05,012   INFO  Database filter by min points barrier: 107507 => 101993
2023-07-18 19:25:05,015   INFO  Database filter by min points motorcycle: 8846 => 8055
2023-07-18 19:25:05,015   INFO  Database filter by min points bicycle: 8185 => 7531
2023-07-18 19:25:05,029   INFO  Database filter by min points pedestrian: 161928 => 148520
2023-07-18 19:25:05,038   INFO  Database filter by min points traffic_cone: 62964 => 55504
2023-07-18 19:25:05,113   INFO  Loading NuScenes dataset
2023-07-18 19:25:08,730   INFO  Total samples for NuScenes dataset: 28130
2023-07-18 19:25:08,959   INFO  Total samples after balanced resampling: 123580
2023-07-18 19:25:10,453   INFO  ==> Loading parameters from checkpoint /home/me/tmp/OpenPCDet/output/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint/default/ckpt/latest_model.pth to CPU
2023-07-18 19:25:10,539   INFO  ==> Loading optimizer parameters from checkpoint /home/me/tmp/OpenPCDet/output/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint/default/ckpt/latest_model.pth to CPU
==> Checkpoint trained from version: pcdet+0.6.0+557d463+py6424bf1
2023-07-18 19:25:10,611   INFO  ==> Done
==> Checkpoint trained from version: pcdet+0.6.0+557d463+py6424bf1
AI-3090:1890568:1890568 [0] NCCL INFO Bootstrap : Using eno1:10.113.217.220<0>
AI-3090:1890568:1890568 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
AI-3090:1890568:1890568 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.14.3+cuda11.8
AI-3090:1890569:1890569 [1] NCCL INFO cudaDriverVersion 12020
AI-3090:1890568:1890933 [0] NCCL INFO NET/IB : No device found.
AI-3090:1890568:1890933 [0] NCCL INFO NET/Socket : Using [0]eno1:10.113.217.220<0> [1]zteb4hu7vw:192.168.191.190<0> [2]ztwdjbud4p:192.168.192.190<0>
AI-3090:1890568:1890933 [0] NCCL INFO Using network Socket
AI-3090:1890569:1890569 [1] NCCL INFO Bootstrap : Using eno1:10.113.217.220<0>
AI-3090:1890569:1890569 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
AI-3090:1890569:1890937 [1] NCCL INFO NET/IB : No device found.
AI-3090:1890569:1890937 [1] NCCL INFO NET/Socket : Using [0]eno1:10.113.217.220<0> [1]zteb4hu7vw:192.168.191.190<0> [2]ztwdjbud4p:192.168.192.190<0>
AI-3090:1890569:1890937 [1] NCCL INFO Using network Socket
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890569:1890937 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890568:1890933 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff
AI-3090:1890569:1890937 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
AI-3090:1890568:1890933 [0] NCCL INFO Channel 00/02 :    0   1
AI-3090:1890568:1890933 [0] NCCL INFO Channel 01/02 :    0   1
AI-3090:1890568:1890933 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO Channel 00 : 0[67000] -> 1[68000] via SHM/direct/direct
AI-3090:1890568:1890933 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890568:1890933 [0] NCCL INFO Could not enable P2P between dev 0(=67000) and dev 1(=68000)
AI-3090:1890569:1890937 [1] NCCL INFO Channel 00 : 1[68000] -> 0[67000] via SHM/direct/direct
AI-3090:1890569:1890937 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
AI-3090:1890569:1890937 [1] NCCL INFO Could not enable P2P between dev 1(=68000) and dev 0(=67000)
AI-3090:1890568:1890933 [0] NCCL INFO Channel 01 : 0[67000] -> 1[68000] via SHM/direct/direct
AI-3090:1890569:1890937 [1] NCCL INFO Channel 01 : 1[68000] -> 0[67000] via SHM/direct/direct
AI-3090:1890568:1890933 [0] NCCL INFO Connected all rings
AI-3090:1890568:1890933 [0] NCCL INFO Connected all trees
AI-3090:1890568:1890933 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
AI-3090:1890568:1890933 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
AI-3090:1890569:1890937 [1] NCCL INFO Connected all rings
AI-3090:1890569:1890937 [1] NCCL INFO Connected all trees
AI-3090:1890569:1890937 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
AI-3090:1890569:1890937 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
AI-3090:1890568:1890933 [0] NCCL INFO comm 0x282bda90 rank 0 nranks 2 cudaDev 0 busId 67000 - Init COMPLETE
AI-3090:1890569:1890937 [1] NCCL INFO comm 0x20fee6c0 rank 1 nranks 2 cudaDev 1 busId 68000 - Init COMPLETE
2023-07-18 19:25:10,924   INFO  ----------- Model CenterPoint created, param count: 8941590 -----------
2023-07-18 19:25:10,924   INFO  DistributedDataParallel(
  (module): CenterPoint(
    (vfe): MeanVFE()
    (backbone_3d): VoxelResBackBone8x(
      (conv_input): SparseSequential(
        (0): SubMConv3d(5, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm)
        (1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
      )
      (conv1): SparseSequential(
        (0): SparseBasicBlock(
          (conv1): SubMConv3d(16, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(16, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
        (1): SparseBasicBlock(
          (conv1): SubMConv3d(16, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(16, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
      )
      (conv2): SparseSequential(
        (0): SparseSequential(
          (0): SparseConv3d(16, 32, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm)
          (1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): ReLU()
        )
        (1): SparseBasicBlock(
          (conv1): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
        (2): SparseBasicBlock(
          (conv1): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
      )
      (conv3): SparseSequential(
        (0): SparseSequential(
          (0): SparseConv3d(32, 64, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm)
          (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): ReLU()
        )
        (1): SparseBasicBlock(
          (conv1): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
        (2): SparseBasicBlock(
          (conv1): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
      )
      (conv4): SparseSequential(
        (0): SparseSequential(
          (0): SparseConv3d(64, 128, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[0, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm)
          (1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): ReLU()
        )
        (1): SparseBasicBlock(
          (conv1): SubMConv3d(128, 128, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(128, 128, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
        (2): SparseBasicBlock(
          (conv1): SubMConv3d(128, 128, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d(128, 128, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], algo=ConvAlgo.MaskImplicitGemm)
          (bn2): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        )
      )
      (conv_out): SparseSequential(
        (0): SparseConv3d(128, 128, kernel_size=[3, 1, 1], stride=[2, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm)
        (1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
      )
    )
    (map_to_bev_module): HeightCompression()
    (pfe): None
    (backbone_2d): BaseBEVBackbone(
      (blocks): ModuleList(
        (0): Sequential(
          (0): ZeroPad2d((1, 1, 1, 1))
          (1): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), bias=False)
          (2): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (3): ReLU()
          (4): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (5): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (6): ReLU()
          (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (8): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (9): ReLU()
          (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (11): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (12): ReLU()
          (13): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (14): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (15): ReLU()
          (16): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (17): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (18): ReLU()
        )
        (1): Sequential(
          (0): ZeroPad2d((1, 1, 1, 1))
          (1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), bias=False)
          (2): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (3): ReLU()
          (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (5): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (6): ReLU()
          (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (8): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (9): ReLU()
          (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (11): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (12): ReLU()
          (13): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (14): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (15): ReLU()
          (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (17): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (18): ReLU()
        )
      )
      (deblocks): ModuleList(
        (0): Sequential(
          (0): ConvTranspose2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (1): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): ReLU()
        )
        (1): Sequential(
          (0): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2), bias=False)
          (1): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (2): ReLU()
        )
      )
    )
    (dense_head): CenterHead(
      (shared_conv): Sequential(
        (0): Conv2d(512, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU()
      )
      (heads_list): ModuleList(
        (0): SeparateHead(
          (center): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (center_z): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (dim): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (rot): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (vel): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (hm): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (1-2): 2 x SeparateHead(
          (center): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (center_z): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (dim): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (rot): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (vel): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (hm): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (3): SeparateHead(
          (center): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (center_z): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (dim): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (rot): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (vel): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (hm): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
        (4-5): 2 x SeparateHead(
          (center): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (center_z): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (dim): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (rot): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (vel): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
          (hm): Sequential(
            (0): Sequential(
              (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
              (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
              (2): ReLU()
            )
            (1): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
          )
        )
      )
      (hm_loss_func): FocalLossCenterNet()
      (reg_loss_func): RegLossCenterNet()
    )
    (point_head): None
    (roi_head): None
  )
)
epochs:   0%|                                                                                                                                                                                                                                                                                                                                        | 0/15 [00:00<?, ?it/s]
2023-07-18 19:25:10,927   INFO  **********************Start training cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint(default)**********************
epochs:   0%|                                                                                                                                                                                                                                                                                                                                        | 0/15 [00:00<?, ?it/s/
home/me/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py:66: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()0:00<?, ?it/s]
  storage = torch.ByteStorage.from_buffer(buffer)
/home/me/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py:66: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = torch.ByteStorage.from_buffer(buffer)
2023-07-18 19:32:22,439   INFO  Train:    6/20 ( 30%) [7583/15448 ( 49%)]  Loss: 6.969 (6.97)  LR: 9.959e-04  Time cost: 00:27/60:56:16 [07:11/1736:36:30]  Acc_iter 131168      Data time: 3.82(3.82)  Forward time: 23.99(23.99)  Batch time: 27.81(27.81)
2023-07-18 19:32:42,736   INFO  Train:    6/20 ( 30%) [7615/15448 ( 49%)]  Loss: 6.596 (7.10)  LR: 9.958e-04  Time cost: 00:48/3:10:38 [07:31/90:54:24]  Acc_iter 131200      Data time: 0.01(0.13)  Forward time: 0.68(1.33)  Batch time: 0.69(1.46)
2023-07-18 19:33:13,420   INFO  Train:    6/20 ( 30%) [7665/15448 ( 50%)]  Loss: 7.256 (7.00)  LR: 9.958e-04  Time cost: 01:18/2:03:16 [08:02/59:08:37]  Acc_iter 131250      Data time: 0.00(0.06)  Forward time: 0.45(0.89)  Batch time: 0.45(0.95)
2023-07-18 19:33:13,792   INFO  AI-3090                     Tue Jul 18 19:33:13 2023  535.54.03
[0] NVIDIA GeForce RTX 3090 | 59°C,  33 % |  9715 / 24576 MB | me(9698M) root(4M)
[1] NVIDIA GeForce RTX 3090 | 65°C,  74 % | 10812 / 24576 MB | me(10784M) root(14M)

.... (Remove some log there)

2023-07-18 19:52:38,866   INFO  Train:    6/20 ( 30%) [9515/15448 ( 62%)]  Loss: 6.884 (6.98)  LR: 9.935e-04  Time cost: 20:44/1:03:39 [27:27/39:43:58]  Acc_iter 133100      Data time: 0.01(0.02)  Forward time: 0.63(0.63)  Batch time: 0.63(0.64)
2023-07-18 19:53:10,634   INFO  Train:    6/20 ( 30%) [9565/15448 ( 62%)]  Loss: 6.337 (6.99)  LR: 9.934e-04  Time cost: 21:16/1:03:05 [27:59/39:42:39]  Acc_iter 133150      Data time: 0.01(0.02)  Forward time: 0.68(0.63)  Batch time: 0.68(0.64)
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22206, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22206, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809937 milliseconds before timing out.
AI-3090:1890569:1890720 [0] NCCL INFO comm 0x20fee6c0 rank 1 nranks 2 cudaDev 1 busId 68000 - Abort COMPLETE
Traceback (most recent call last):
  File "/home/me/tmp/OpenPCDet/tools/train.py", line 239, in <module>
    main()
  File "/home/me/tmp/OpenPCDet/tools/train.py", line 183, in main
    train_model(
  File "/home/me/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 207, in train_model
    accumulated_iter = train_one_epoch(
                       ^^^^^^^^^^^^^^^^
  File "/home/me/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 96, in train_one_epoch
    avg_data_time = commu_utils.average_reduce_value(cur_data_time)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py", line 144, in average_reduce_value
    data_list = all_gather(data)
                ^^^^^^^^^^^^^^^^
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py", line 77, in all_gather
    dist.all_gather(size_list, local_size)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2448, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=22206, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1809903 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1890568 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 1890569) of binary: /home/me/.conda/envs/pcdet/bin/python
Traceback (most recent call last):
  File "/home/me/.conda/envs/pcdet/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
train.py FAILED
--------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-18_20:23:52
  host      : AI-3090
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1890569)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1890569
========================================================

Stange Behavior Capture:

The GPU utilization of both stuck at 100%, at the beginning the prgram ran normally, the utilization were various. The GPU 1 seems sending and receiving something since its TX and RX using large bandwidth. But GPU 0 seems no action withmuch lower bandwidth for both TX and TX. And both process using 100% CPU.

open-mmlab / OpenPCDet

NCCL stuck at collective operations timeout with multiple GPUs #1409