open-mmlab / OpenPCDet

OpenPCDet Toolbox for LiDAR-based 3D Object Detection.
Apache License 2.0
4.62k stars 1.29k forks source link

NaN or Inf found in input tensor in NUSCENES #1005

Closed JianyyuWang closed 2 years ago

JianyyuWang commented 2 years ago

2022-06-19 11:16:02,958 INFO **Start logging** 2022-06-19 11:16:02,958 INFO CUDA_VISIBLE_DEVICES=0, 1 2022-06-19 11:16:02,958 INFO cfg_file cfgs/nuscenes_models/cbgs_pv_rcnn_multihead_no_velo.yaml 2022-06-19 11:16:02,958 INFO batch_size 4 2022-06-19 11:16:02,958 INFO epochs 20 2022-06-19 11:16:02,958 INFO workers 4 2022-06-19 11:16:02,958 INFO extra_tag default 2022-06-19 11:16:02,958 INFO ckpt None 2022-06-19 11:16:02,958 INFO pretrained_model None 2022-06-19 11:16:02,958 INFO launcher none 2022-06-19 11:16:02,958 INFO tcp_port 18888 2022-06-19 11:16:02,958 INFO sync_bn False 2022-06-19 11:16:02,958 INFO fix_random_seed False 2022-06-19 11:16:02,958 INFO ckpt_save_interval 1 2022-06-19 11:16:02,958 INFO local_rank 0 2022-06-19 11:16:02,958 INFO max_ckpt_save_num 30 2022-06-19 11:16:02,958 INFO merge_all_iters_to_one_epoch False 2022-06-19 11:16:02,958 INFO set_cfgs None 2022-06-19 11:16:02,958 INFO max_waiting_mins 0 2022-06-19 11:16:02,958 INFO start_epoch 0 2022-06-19 11:16:02,958 INFO num_epochs_to_eval 0 2022-06-19 11:16:02,958 INFO save_to_file False 2022-06-19 11:16:02,958 INFO cfg.ROOT_DIR: /root/workspace/OpenPCDet 2022-06-19 11:16:02,958 INFO cfg.LOCAL_RANK: 0 2022-06-19 11:16:02,958 INFO cfg.CLASS_NAMES: ['car', 'truck', 'construction_vehicle', 'bus', 'trailer', 'barrier', 'motorcycle', 'bicycle', 'pedestrian', 'traffic_cone'] 2022-06-19 11:16:02,958 INFO
cfg.DATA_CONFIG = edict() 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.DATASET: NuScenesDataset 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.DATA_PATH: ../data/nuscenes 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.VERSION: v1.0-trainval 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.MAX_SWEEPS: 10 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.PRED_VELOCITY: False 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.SET_NAN_VELOCITY_TO_ZEROS: True 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.FILTER_MIN_POINTS_IN_GT: 1 2022-06-19 11:16:02,958 INFO
cfg.DATA_CONFIG.DATA_SPLIT = edict() 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.DATA_SPLIT.train: train 2022-06-19 11:16:02,958 INFO cfg.DATA_CONFIG.DATA_SPLIT.test: val 2022-06-19 11:16:02,959 INFO
cfg.DATA_CONFIG.INFO_PATH = edict() 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.INFO_PATH.train: ['nuscenes_infos_10sweeps_train.pkl'] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.INFO_PATH.test: ['nuscenes_infos_10sweeps_val.pkl'] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.POINT_CLOUD_RANGE: [-51.2, -51.2, -5.0, 51.2, 51.2, 3.0] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.BALANCED_RESAMPLING: True 2022-06-19 11:16:02,959 INFO
cfg.DATA_CONFIG.DATA_AUGMENTOR = edict() 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.DATA_AUGMENTOR.DISABLE_AUG_LIST: ['placeholder'] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.DATA_AUGMENTOR.AUG_CONFIG_LIST: [{'NAME': 'gt_sampling', 'DB_INFO_PATH': ['nuscenes_dbinfos_10sweeps_withvelo.pkl'], 'PREPARE': {'filter_by_min_points': ['car:5', 'truck:5', 'construction_vehicle:5', 'bus:5', 'trailer:5', 'barrier:5', 'motorcycle:5', 'bicycle:5', 'pedestrian:5', 'traffic_cone:5']}, 'SAMPLE_GROUPS': ['car:2', 'truck:3', 'construction_vehicle:7', 'bus:4', 'trailer:6', 'barrier:2', 'motorcycle:6', 'bicycle:6', 'pedestrian:2', 'traffic_cone:2'], 'NUM_POINT_FEATURES': 5, 'DATABASE_WITH_FAKELIDAR': False, 'REMOVE_EXTRA_WIDTH': [0.0, 0.0, 0.0], 'LIMIT_WHOLE_SCENE': True}, {'NAME': 'random_world_flip', 'ALONG_AXIS_LIST': ['x', 'y']}, {'NAME': 'random_world_rotation', 'WORLD_ROT_ANGLE': [-0.3925, 0.3925]}, {'NAME': 'random_world_scaling', 'WORLD_SCALE_RANGE': [0.95, 1.05]}] 2022-06-19 11:16:02,959 INFO
cfg.DATA_CONFIG.POINT_FEATURE_ENCODING = edict() 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.encoding_type: absolute_coordinates_encoding 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.used_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp'] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.POINT_FEATURE_ENCODING.src_feature_list: ['x', 'y', 'z', 'intensity', 'timestamp'] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG.DATA_PROCESSOR: [{'NAME': 'mask_points_and_boxes_outside_range', 'REMOVE_OUTSIDE_BOXES': True}, {'NAME': 'shuffle_points', 'SHUFFLE_ENABLED': {'train': True, 'test': True}}, {'NAME': 'transform_points_to_voxels', 'VOXEL_SIZE': [0.1, 0.1, 0.2], 'MAX_POINTS_PER_VOXEL': 10, 'MAX_NUMBER_OF_VOXELS': {'train': 60000, 'test': 60000}}] 2022-06-19 11:16:02,959 INFO cfg.DATA_CONFIG._BASECONFIG: cfgs/dataset_configs/nuscenes_dataset_no_velocity.yaml 2022-06-19 11:16:02,959 INFO
cfg.MODEL = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.NAME: PVRCNN 2022-06-19 11:16:02,959 INFO
cfg.MODEL.VFE = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.VFE.NAME: MeanVFE 2022-06-19 11:16:02,959 INFO
cfg.MODEL.BACKBONE_3D = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_3D.NAME: VoxelBackBone8x 2022-06-19 11:16:02,959 INFO
cfg.MODEL.MAP_TO_BEV = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.MAP_TO_BEV.NAME: HeightCompression 2022-06-19 11:16:02,959 INFO cfg.MODEL.MAP_TO_BEV.NUM_BEV_FEATURES: 256 2022-06-19 11:16:02,959 INFO
cfg.MODEL.BACKBONE_2D = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.NAME: BaseBEVBackbone 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.LAYER_NUMS: [5, 5] 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.LAYER_STRIDES: [1, 2] 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.NUM_FILTERS: [128, 256] 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.UPSAMPLE_STRIDES: [1, 2] 2022-06-19 11:16:02,959 INFO cfg.MODEL.BACKBONE_2D.NUM_UPSAMPLE_FILTERS: [256, 256] 2022-06-19 11:16:02,959 INFO
cfg.MODEL.DENSE_HEAD = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.NAME: AnchorHeadMulti 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.CLASS_AGNOSTIC: False 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.DIR_OFFSET: 0.78539 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.DIR_LIMIT_OFFSET: 0.0 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.NUM_DIR_BINS: 2 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.USE_MULTIHEAD: True 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.SEPARATE_MULTIHEAD: True 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.ANCHOR_GENERATOR_CONFIG: [{'class_name': 'car', 'anchor_sizes': [[4.63, 1.97, 1.74]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.95], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.6, 'unmatched_threshold': 0.45}, {'class_name': 'truck', 'anchor_sizes': [[6.93, 2.51, 2.84]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.6], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.55, 'unmatched_threshold': 0.4}, {'class_name': 'construction_vehicle', 'anchor_sizes': [[6.37, 2.85, 3.19]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.225], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'bus', 'anchor_sizes': [[10.5, 2.94, 3.47]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.085], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.55, 'unmatched_threshold': 0.4}, {'class_name': 'trailer', 'anchor_sizes': [[12.29, 2.9, 3.87]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [0.115], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'barrier', 'anchor_sizes': [[0.5, 2.53, 0.98]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.33], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.55, 'unmatched_threshold': 0.4}, {'class_name': 'motorcycle', 'anchor_sizes': [[2.11, 0.77, 1.47]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.085], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.5, 'unmatched_threshold': 0.3}, {'class_name': 'bicycle', 'anchor_sizes': [[1.7, 0.6, 1.28]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.18], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.5, 'unmatched_threshold': 0.35}, {'class_name': 'pedestrian', 'anchor_sizes': [[0.73, 0.67, 1.77]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-0.935], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.6, 'unmatched_threshold': 0.4}, {'class_name': 'traffic_cone', 'anchor_sizes': [[0.41, 0.41, 1.07]], 'anchor_rotations': [0, 1.57], 'anchor_bottom_heights': [-1.285], 'align_center': False, 'feature_map_stride': 8, 'matched_threshold': 0.6, 'unmatched_threshold': 0.4}] 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.SHARED_CONV_NUM_FILTER: 64 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.RPN_HEAD_CFGS: [{'HEAD_CLS_NAME': ['car']}, {'HEAD_CLS_NAME': ['truck', 'construction_vehicle']}, {'HEAD_CLS_NAME': ['bus', 'trailer']}, {'HEAD_CLS_NAME': ['barrier']}, {'HEAD_CLS_NAME': ['motorcycle', 'bicycle']}, {'HEAD_CLS_NAME': ['pedestrian', 'traffic_cone']}] 2022-06-19 11:16:02,959 INFO
cfg.MODEL.DENSE_HEAD.SEPARATE_REG_CONFIG = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.SEPARATE_REG_CONFIG.NUM_MIDDLE_CONV: 1 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.SEPARATE_REG_CONFIG.NUM_MIDDLE_FILTER: 64 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.SEPARATE_REG_CONFIG.REG_LIST: ['reg:2', 'height:1', 'size:3', 'angle:2'] 2022-06-19 11:16:02,959 INFO
cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG = edict() 2022-06-19 11:16:02,959 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NAME: AxisAlignedTargetAssigner 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.POS_FRACTION: -1.0 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.SAMPLE_SIZE: 512 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.NORM_BY_NUM_EXAMPLES: False 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.MATCH_HEIGHT: False 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER: ResidualCoder 2022-06-19 11:16:02,960 INFO
cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.code_size: 7 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.TARGET_ASSIGNER_CONFIG.BOX_CODER_CONFIG.encode_angle_by_sincos: True 2022-06-19 11:16:02,960 INFO
cfg.MODEL.DENSE_HEAD.LOSS_CONFIG = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.REG_LOSS_TYPE: WeightedL1Loss 2022-06-19 11:16:02,960 INFO
cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.pos_cls_weight: 1.0 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.neg_cls_weight: 2.0 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.cls_weight: 1.0 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.loc_weight: 0.25 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.dir_weight: 0.2 2022-06-19 11:16:02,960 INFO cfg.MODEL.DENSE_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.code_weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.NAME: VoxelSetAbstraction 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.POINT_SOURCE: raw_points 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.NUM_KEYPOINTS: 4096 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.NUM_OUTPUT_FEATURES: 128 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SAMPLE_METHOD: FPS 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.FEATURES_SOURCE: ['bev', 'x_conv3', 'x_conv4', 'raw_points'] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER = edict() 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER.raw_points = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.raw_points.MLPS: [[16, 16], [16, 16]] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.raw_points.POOL_RADIUS: [0.4, 0.8] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.raw_points.NSAMPLE: [16, 16] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER.x_conv1 = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv1.DOWNSAMPLE_FACTOR: 1 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv1.MLPS: [[16, 16], [16, 16]] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv1.POOL_RADIUS: [0.4, 0.8] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv1.NSAMPLE: [16, 16] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER.x_conv2 = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv2.DOWNSAMPLE_FACTOR: 2 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv2.MLPS: [[32, 32], [32, 32]] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv2.POOL_RADIUS: [0.8, 1.2] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv2.NSAMPLE: [16, 32] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER.x_conv3 = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv3.DOWNSAMPLE_FACTOR: 4 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv3.MLPS: [[64, 64], [64, 64]] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv3.POOL_RADIUS: [1.2, 2.4] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv3.NSAMPLE: [16, 32] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.PFE.SA_LAYER.x_conv4 = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv4.DOWNSAMPLE_FACTOR: 8 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv4.MLPS: [[64, 64], [64, 64]] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv4.POOL_RADIUS: [2.4, 4.8] 2022-06-19 11:16:02,960 INFO cfg.MODEL.PFE.SA_LAYER.x_conv4.NSAMPLE: [16, 32] 2022-06-19 11:16:02,960 INFO
cfg.MODEL.POINT_HEAD = edict() 2022-06-19 11:16:02,960 INFO cfg.MODEL.POINT_HEAD.NAME: PointHeadSimple 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.CLS_FC: [256, 256] 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.CLASS_AGNOSTIC: True 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.USE_POINT_FEATURES_BEFORE_FUSION: True 2022-06-19 11:16:02,961 INFO
cfg.MODEL.POINT_HEAD.TARGET_CONFIG = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.TARGET_CONFIG.GT_EXTRA_WIDTH: [0.2, 0.2, 0.2] 2022-06-19 11:16:02,961 INFO
cfg.MODEL.POINT_HEAD.LOSS_CONFIG = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.LOSS_CONFIG.LOSS_REG: smooth-l1 2022-06-19 11:16:02,961 INFO
cfg.MODEL.POINT_HEAD.LOSS_CONFIG.LOSS_WEIGHTS = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.POINT_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.point_cls_weight: 1.0 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NAME: PVRCNNHead_NUSCENES_v1 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.CLASS_AGNOSTIC: True 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.SHARED_FC: [256, 256] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.CLS_FC: [256, 256] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.REG_FC: [256, 256] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.DP_RATIO: 0.3 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.SEPARATE_MULTIHEAD: True 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.NMS_CONFIG = edict() 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.NMS_TYPE: nms_gpu 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.MULTI_CLASSES_NMS: False 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.NMS_PRE_MAXSIZE: 9000 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.NMS_POST_MAXSIZE: 128 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.NUM_HEAD: 6 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TRAIN.NMS_THRESH: 0.8 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST.NMS_TYPE: nms_gpu 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST.MULTI_CLASSES_NMS: False 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST.NMS_PRE_MAXSIZE: 4096 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST.NMS_POST_MAXSIZE: 300 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.NMS_CONFIG.TEST.NMS_THRESH: 0.85 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.ROI_GRID_POOL = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.ROI_GRID_POOL.GRID_SIZE: 6 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.ROI_GRID_POOL.MLPS: [[64, 64], [64, 64]] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.ROI_GRID_POOL.POOL_RADIUS: [0.8, 1.6] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.ROI_GRID_POOL.NSAMPLE: [16, 16] 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.ROI_GRID_POOL.POOL_METHOD: max_pool 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.TARGET_CONFIG = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.BOX_CODER: ResidualCoder 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.ROI_PER_IMAGE: 128 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.FG_RATIO: 0.5 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.SAMPLE_ROI_BY_EACH_CLASS: True 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.CLS_SCORE_TYPE: roi_iou 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.CLS_FG_THRESH: 0.75 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.CLS_BG_THRESH: 0.25 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.CLS_BG_THRESH_LO: 0.1 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.HARD_BG_RATIO: 0.8 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.TARGET_CONFIG.REG_FG_THRESH: 0.55 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.LOSS_CONFIG = edict() 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.CLS_LOSS: BinaryCrossEntropy 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.REG_LOSS: smooth-l1 2022-06-19 11:16:02,961 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.CORNER_LOSS_REGULARIZATION: True 2022-06-19 11:16:02,961 INFO
cfg.MODEL.ROI_HEAD.LOSS_CONFIG.LOSS_WEIGHTS = edict() 2022-06-19 11:16:02,962 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.rcnn_cls_weight: 1.0 2022-06-19 11:16:02,962 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.rcnn_reg_weight: 1.0 2022-06-19 11:16:02,962 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.rcnn_corner_weight: 1.0 2022-06-19 11:16:02,962 INFO cfg.MODEL.ROI_HEAD.LOSS_CONFIG.LOSS_WEIGHTS.code_weights: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] 2022-06-19 11:16:02,962 INFO
cfg.MODEL.POST_PROCESSING = edict() 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.RECALL_THRESH_LIST: [0.3, 0.5, 0.7] 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.SCORE_THRESH: 0.1 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.OUTPUT_RAW_SCORE: False 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.EVAL_METRIC: kitti 2022-06-19 11:16:02,962 INFO
cfg.MODEL.POST_PROCESSING.NMS_CONFIG = edict() 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.MULTI_CLASSES_NMS: True 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_TYPE: nms_gpu 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_THRESH: 0.2 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_PRE_MAXSIZE: 1000 2022-06-19 11:16:02,962 INFO cfg.MODEL.POST_PROCESSING.NMS_CONFIG.NMS_POST_MAXSIZE: 83 2022-06-19 11:16:02,962 INFO
cfg.OPTIMIZATION = edict() 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.BATCH_SIZE_PER_GPU: 4 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.NUM_EPOCHS: 20 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.OPTIMIZER: adam_onecycle 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.LR: 0.003 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.WEIGHT_DECAY: 0.01 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.MOMENTUM: 0.9 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.MOMS: [0.95, 0.85] 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.PCT_START: 0.4 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.DIV_FACTOR: 10 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.DECAY_STEP_LIST: [35, 45] 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.LR_DECAY: 0.1 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.LR_CLIP: 1e-07 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.LR_WARMUP: False 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.WARMUP_EPOCH: 1 2022-06-19 11:16:02,962 INFO cfg.OPTIMIZATION.GRAD_NORM_CLIP: 10 2022-06-19 11:16:02,962 INFO cfg.TAG: cbgs_pv_rcnn_multihead_no_velo 2022-06-19 11:16:02,962 INFO cfg.EXP_GROUP_PATH: nuscenes_models 2022-06-19 11:16:04,884 INFO Database filter by min points car: 339949 => 294532 2022-06-19 11:16:04,892 INFO Database filter by min points truck: 65262 => 60344 2022-06-19 11:16:04,894 INFO Database filter by min points construction_vehicle: 11050 => 10589 2022-06-19 11:16:04,895 INFO Database filter by min points bus: 12286 => 11619 2022-06-19 11:16:04,896 INFO Database filter by min points trailer: 19202 => 17934 2022-06-19 11:16:04,905 INFO Database filter by min points barrier: 107507 => 101993 2022-06-19 11:16:04,907 INFO Database filter by min points motorcycle: 8846 => 8055 2022-06-19 11:16:04,908 INFO Database filter by min points bicycle: 8185 => 7531 2022-06-19 11:16:04,921 INFO Database filter by min points pedestrian: 161928 => 148520 2022-06-19 11:16:04,927 INFO Database filter by min points traffic_cone: 62964 => 55504 2022-06-19 11:16:05,000 INFO Loading NuScenes dataset 2022-06-19 11:16:06,442 INFO Total samples for NuScenes dataset: 28130 2022-06-19 11:16:06,652 INFO Total samples after balanced resampling: 123580 2022-06-19 11:16:13,495 INFO PVRCNN( (vfe): MeanVFE() (backbone_3d): VoxelBackBone8x( (conv_input): SparseSequential( (0): SubMConv3d(5, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (conv1): SparseSequential( (0): SparseSequential( (0): SubMConv3d(16, 16, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) (conv2): SparseSequential( (0): SparseSequential( (0): SparseConv3d(16, 32, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (1): SparseSequential( (0): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (2): SparseSequential( (0): SubMConv3d(32, 32, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(32, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) (conv3): SparseSequential( (0): SparseSequential( (0): SparseConv3d(32, 64, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[1, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (1): SparseSequential( (0): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (2): SparseSequential( (0): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) (conv4): SparseSequential( (0): SparseSequential( (0): SparseConv3d(64, 64, kernel_size=[3, 3, 3], stride=[2, 2, 2], padding=[0, 1, 1], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (1): SparseSequential( (0): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (2): SparseSequential( (0): SubMConv3d(64, 64, kernel_size=[3, 3, 3], stride=[1, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) (conv_out): SparseSequential( (0): SparseConv3d(64, 128, kernel_size=[3, 1, 1], stride=[2, 1, 1], padding=[0, 0, 0], dilation=[1, 1, 1], output_padding=[0, 0, 0], bias=False, algo=ConvAlgo.MaskImplicitGemm) (1): BatchNorm1d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) (map_to_bev_module): HeightCompression() (pfe): VoxelSetAbstraction( (SA_layers): ModuleList( (0): StackSAModuleMSG( (groupers): ModuleList( (0): QueryAndGroup() (1): QueryAndGroup() ) (mlps): ModuleList( (0): Sequential( (0): Conv2d(67, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) (1): Sequential( (0): Conv2d(67, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) ) ) (1): StackSAModuleMSG( (groupers): ModuleList( (0): QueryAndGroup() (1): QueryAndGroup() ) (mlps): ModuleList( (0): Sequential( (0): Conv2d(67, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) (1): Sequential( (0): Conv2d(67, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) ) ) ) (SA_rawpoints): StackSAModuleMSG( (groupers): ModuleList( (0): QueryAndGroup() (1): QueryAndGroup() ) (mlps): ModuleList( (0): Sequential( (0): Conv2d(5, 16, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) (1): Sequential( (0): Conv2d(5, 16, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) ) ) (vsa_point_feature_fusion): Sequential( (0): Linear(in_features=544, out_features=128, bias=False) (1): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ) ) (backbone_2d): BaseBEVBackbone( (blocks): ModuleList( (0): Sequential( (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0) (1): Conv2d(256, 128, kernel_size=(3, 3), stride=(1, 1), bias=False) (2): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (3): ReLU() (4): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (6): ReLU() (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (12): ReLU() (13): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (14): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (15): ReLU() (16): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (17): BatchNorm2d(128, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (18): ReLU() ) (1): Sequential( (0): ZeroPad2d(padding=(1, 1, 1, 1), value=0.0) (1): Conv2d(128, 256, kernel_size=(3, 3), stride=(2, 2), bias=False) (2): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (3): ReLU() (4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (5): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (6): ReLU() (7): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (8): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (11): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (12): ReLU() (13): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (14): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (15): ReLU() (16): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (17): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (18): ReLU() ) ) (deblocks): ModuleList( (0): Sequential( (0): ConvTranspose2d(128, 256, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (1): Sequential( (0): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2), bias=False) (1): BatchNorm2d(256, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) ) ) (dense_head): AnchorHeadMulti( (cls_loss_func): SigmoidFocalClassificationLoss() (reg_loss_func): WeightedL1Loss() (dir_loss_func): WeightedCrossEntropyLoss() (shared_conv): Sequential( (0): Conv2d(512, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=0.001, momentum=0.01, affine=True, track_running_stats=True) (2): ReLU() ) (rpn_heads): ModuleList( (0): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (1): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (2): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (3): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 2, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (4): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (5): SingleHead( (blocks): ModuleList() (deblocks): ModuleList() (conv_box): ModuleDict( (conv_reg): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_height): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_size): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (conv_angle): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) (conv_cls): Sequential( (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) ) ) ) (point_head): PointHeadSimple( (cls_loss_func): SigmoidFocalClassificationLoss() (cls_layers): Sequential( (0): Linear(in_features=544, out_features=256, bias=False) (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Linear(in_features=256, out_features=256, bias=False) (4): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() (6): Linear(in_features=256, out_features=1, bias=True) ) ) (roi_head): PVRCNNHead_NUSCENES_v1( (proposal_target_layer): ProposalTargetLayer() (reg_loss_func): WeightedSmoothL1Loss() (roi_grid_pool_layer): StackSAModuleMSG( (groupers): ModuleList( (0): QueryAndGroup() (1): QueryAndGroup() ) (mlps): ModuleList( (0): Sequential( (0): Conv2d(131, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) (1): Sequential( (0): Conv2d(131, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False) (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() ) ) ) (shared_fc_layer): Sequential( (0): Conv1d(27648, 256, kernel_size=(1,), stride=(1,), bias=False) (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Dropout(p=0.3, inplace=False) (4): Conv1d(256, 256, kernel_size=(1,), stride=(1,), bias=False) (5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (6): ReLU() ) (cls_layers): Sequential( (0): Conv1d(256, 256, kernel_size=(1,), stride=(1,), bias=False) (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Dropout(p=0.3, inplace=False) (4): Conv1d(256, 256, kernel_size=(1,), stride=(1,), bias=False) (5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (6): ReLU() (7): Conv1d(256, 1, kernel_size=(1,), stride=(1,)) ) (reg_layers): Sequential( (0): Conv1d(256, 256, kernel_size=(1,), stride=(1,), bias=False) (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Dropout(p=0.3, inplace=False) (4): Conv1d(256, 256, kernel_size=(1,), stride=(1,), bias=False) (5): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (6): ReLU() (7): Conv1d(256, 7, kernel_size=(1,), stride=(1,)) ) ) ) 2022-06-19 11:16:13,498 INFO **Start training nuscenes_models/cbgs_pv_rcnn_multihead_no_velo(default)** epochs: 0%| | 0/20 [00:00<?, ?it/s] train: 0%| | 0/30895 [00:00<?, ?it/s] train: 0%| | 1/30895 [00:23<203:49:59, 23.75s/it] epochs: 0%| | 0/20 [00:24<?, ?it/s, loss=11.8, lr=0.0003, d_time=3.47(3.47), f train: 0%| | 2/30895 [00:29<111:07:03, 12.95s/it, total_it=1] epochs: 0%| | 0/20 [00:29<?, ?it/s, loss=nan, lr=0.0003, d_time=0.00(1.74), f_NaN or Inf found in input tensor. NaN or Inf found in input tensor. NaN or Inf found in input tensor. maxoverlaps:(min=nan, max=nan) ERROR: FG=0, BG=0 epochs: 0%| | 0/20 [00:33<?, ?it/s, loss=nan, lr=0.0003, dtime=0.00(1.74), f Traceback (most recent call last): File "/root/workspace/OpenPCDet/tools/train.py", line 211, in main() File "/root/workspace/OpenPCDet/tools/train.py", line 163, in main train_model( File "/root/workspace/OpenPCDet/tools/train_utils/train_utils.py", line 110, in train_model accumulated_iter = train_one_epoch( File "/root/workspace/OpenPCDet/tools/train_utils/train_utils.py", line 46, in train_one_epoch loss, tb_dict, disp_dict = model_func(model, batch) File "/root/workspace/OpenPCDet/tools/../pcdet/models/init.py", line 43, in model_func ret_dict, tb_dict, disp_dict = model(batch_dict) File "/root/anaconda3/envs/OpenPCDet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, *kwargs) File "/root/workspace/OpenPCDet/tools/../pcdet/models/detectors/pv_rcnn.py", line 11, in forward batch_dict = cur_module(batch_dict) File "/root/anaconda3/envs/OpenPCDet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/root/workspace/OpenPCDet/tools/../pcdet/models/roi_heads/pvrcnn_head_for_nuscenes_v1.py", line 238, in forward File "/root/workspace/OpenPCDet/tools/../pcdet/models/roi_heads/roi_head_template.py", line 135, in assign_targets File "/root/workspace/OpenPCDet/tools/../pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 37, in forward File "/root/workspace/OpenPCDet/tools/../pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 138, in sample_rois_for_rcnn File "/root/workspace/OpenPCDet/tools/../pcdet/models/roi_heads/target_assigner/proposal_target_layer.py", line 190, in subsample_rois NotImplementedError

when i was training on the Nuscenes datasets, i found this bug, and dont know how to fix it. I tried change the spconv 2.1.21 -> 2.1.1 2.0.2 ,it still has the follow problem. my pcdet 0.5.2 pytorch==1.8.0 cuda11.1

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

Camellia-hz commented 1 year ago

@JianyyuWang Hi, hello, I have the same problem as you, do you know the solution yet?

OuYaozhong commented 1 year ago

Hi, @JianyyuWang @zbaishancha @sshaoshuai @jihanyang @chenshi3 I met this problem too.

  1. The original learning rate in the tools/cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml is 0.001

But i have modified to with 100 times smaller as 0.00001, it still occur the NaN problem.

The loss is NaN and then the tensorboard throw the info that "NaN or Inf found in the tensor" message.

I have tried and proved that this problem have no relation with if using multiple GPUs.

I can not understand why cause this problem and reduce the learning rate with 100x smaller still unable to fixed this.

  1. Meanwhie, the program often hang by random, with the same message that "collective operations have faid with timeout", and with 100% GPU utilisation.

It seems that it often hang at the dist.all_together() operation.

Could anybody fix above two problem ?

chenshi3 commented 1 year ago

It appears that the loss becomes NaN as soon as the training procedure begins. You might need to try debugging the code on a single GPU and monitor the loss calculation.

OuYaozhong commented 1 year ago

It appears that the loss becomes NaN as soon as the training procedure begins. You might need to try debugging the code on a single GPU and monitor the loss calculation.

Hi, @chenshi3

For my case, it is not.

  1. The codes I just pull down from github without modification, besides some bug fix to make it run in my machine and environment.
  2. Actually, it is true that the NaN will occur with 50% possibility at the beginning. But sometimes it will run normally to 75% of 1st epoch, and then turn to NaN loss.
  3. I have had a trial to use the single GPU, it came out with the same NaN loss as well.

I am using the OpenPCDet for nuScences dataset and CenterPoint.

Furthermore, I have tried the original CenterPoint, which release by its author. It seems same problem will occur with the default setting.

My environment:

I run with 2 x RTX 3090 use the torch.distributed.run to launch the distribution Data Parallel run with command scripts/dist_train.sh 2 --cfg_file ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml --autoscale-lr --find_unused_parameters --stop-if-nan scripts/dist_train.sh 2 means use 2 GPUs. --autoscale-lr means scale the learning rate with the number of GPUs. The actual lr = cfg.lr * number(GPUs) others are easy to understand.

$ nvidia-smi
Mon Jul 17 11:47:10 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:67:00.0 Off |                  N/A |
| 61%   54C    P2             142W / 350W |  10617MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        Off | 00000000:68:00.0 Off |                  N/A |
| 79%   59C    P2             154W / 350W |  10006MiB / 24576MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1644      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A    875026      C   ........./.conda/envs/pcdet/bin/python    10600MiB |
|    1   N/A  N/A      1644      G   /usr/lib/xorg/Xorg                           14MiB |
|    1   N/A  N/A    875027      C   ........./.conda/envs/pcdet/bin/python     9978MiB |
+---------------------------------------------------------------------------------------+
$ conda list | grep torch
ffmpeg                    4.3                  hf484d3e_0    pytorch
pytorch                   2.0.1           py3.11_cuda11.8_cudnn8.7.0_0    pytorch
pytorch-cuda              11.8                 h7e8668a_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
torchaudio                2.0.2               py311_cu118    pytorch
torchtriton               2.0.0                     py311    pytorch
torchvision               0.15.2              py311_cu118    pytorch
$ echo $CUDA_HOME
/usr/local/cuda-11.8/

---------------------------------------------------------------------- By the way, If the program run normally and no NaN loss occur, another problem will be occur. It may be stuck at the collective operations and finally come as timeout error. Just as below.

2023-07-17 11:35:17,322   INFO  Train:    2/20 ( 10%) [10551/15448 ( 68%)]  Loss: 15.54 (15.9)  LR: 3.895e-06  Time cost: 1:45:26/51:06 [1:48:45/49:13:35]  Acc_iter 26000       Data time: 0.02(0.02)  Forward time: 0.65(0.61)  Batch time: 
0.67(0.63)
2023-07-17 11:35:48,365   INFO  Train:    2/20 ( 10%) [10601/15448 ( 69%)]  Loss: 15.26 (15.9)  LR: 3.902e-06  Time cost: 1:45:57/50:35 [1:49:16/49:12:56]  Acc_iter 26050       Data time: 0.03(0.02)  Forward time: 0.61(0.61)  Batch time: 
0.65(0.63)
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=122360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807840 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=122359, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807806 milliseconds before timing out.
AI-3090:875027:875055 [0] NCCL INFO comm 0x22eecff0 rank 1 nranks 2 cudaDev 1 busId 68000 - Abort COMPLETE
Traceback (most recent call last):                         
  File "/home/$USER/tmp/OpenPCDet/tools/train.py", line 239, in <module>
    main()
  File "/home/$USER/tmp/OpenPCDet/tools/train.py", line 183, in main
    train_model(
  File "/home/$USER/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 183, in train_model
    accumulated_iter = train_one_epoch(
                       ^^^^^^^^^^^^^^^^
  File "/home/$USER/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 71, in train_one_epoch
    avg_data_time = commu_utils.average_reduce_value(cur_data_time)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/$USER/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py", line 144, in average_reduce_value
    data_list = all_gather(data)
                ^^^^^^^^^^^^^^^^
  File "/home/$USER/tmp/OpenPCDet/tools/../pcdet/utils/commu_utils.py", line 77, in all_gather
    dist.all_gather(size_list, local_size)
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2448, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=122360, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1807840 milliseconds be
fore timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 875026 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 875027) of binary: /home/$USER/.conda/envs/pcdet/bin/python
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/$USER/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-17_12:06:32
  host      : AI-3090
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 875027)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 875027
=======================================================
OuYaozhong commented 1 year ago

Hi, @chenshi3 I have debug some log about this NaN issue by torch.autograd.set_detect_anomaly(True)

I am using DDP with command torchrun --nnodes=1 --nproc-per-node=2 train.py --launcher pytorch --cfg_file ./cfgs/nuscenes_models/cbgs_voxel0075_res3d_centerpoint.yaml --backend nccl --stop-if-nan

With 2 x RTX-3090, the error was thrown by GPU 0.

/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/autograd/__init__.py:200: UserWarning: Error detected in ConvolutionBackward0. Traceback of forward call that caused the error:
  File "/home/me/tmp/OpenPCDet/tools/train.py", line 246, in <module>
    main()
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/me/tmp/OpenPCDet/tools/train.py", line 190, in main
    train_model(
  File "/home/me/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 225, in train_model
    accumulated_iter = train_one_epoch(
  File "/home/me/tmp/OpenPCDet/tools/train_utils/train_utils.py", line 81, in train_one_epoch
    loss, tb_dict, disp_dict = model_func(model, batch)
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/models/__init__.py", line 44, in model_func
    ret_dict, tb_dict, disp_dict = model(batch_dict)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])  # type: ignore[index]
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/models/detectors/centerpoint.py", line 14, in forward
    batch_dict = cur_module(batch_dict)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/models/dense_heads/center_head.py", line 391, in forward
    pred_dicts.append(head(x))
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/tmp/OpenPCDet/tools/../pcdet/models/dense_heads/center_head.py", line 44, in forward
    ret_dict[cur_name] = self.__getattr__(cur_name)(x)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/me/.conda/envs/pcdet/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
 (Triggered internally at /opt/conda/conda-bld/pytorch_1682343995622/work/torch/csrc/autograd/python_anomaly_mode.cpp:114.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

It shows that the problem occur at the dense_head. Since i have not go through the code, and just want to reproduce the results the README.md post, i need the help to tell what happen for the code position to turn into NaN output.