robin2002 commented 5 years ago

Hi @traveller59 ,

Thanks for this wonderful implementation. I was trying to train a model using two gpus. Bug It was not working. (Out of index error)
script: CUDA_VISIBLE_DEVICES=0,1 python ./pytorch/train.py train --config_path=./configs/car.fhd.config --model_dir=/data/second/model --multi_gpu=True

Error message: Traceback (most recent call last): Traceback (most recent call last): File "./pytorch/train.py", line 509, in train(config_path=config_path, model_dir=model_dir, multi_gpu=multi_gpu) File "./pytorch/train.py", line 369, in train raise e File "./pytorch/train.py", line 257, in train ret_dict = net_parallel(example_torch) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 252, in forward voxel_list.append(voxels[i, :num_voxel]) IndexError: index 6 is out of bounds for dimension 0 with size 6

traveller59 commented 5 years ago

Could you provide the log.txt in model_dir? I can't test multi-gpu code, so I need your help to fix this kind of bug.

robin2002 commented 5 years ago

I confirmed the log.txt file. There is only a script for a model architecture. So I copy the full message after I run the training script:

(base) robin2002@robin2002-desktop:/media/ssd_2tb/myGit/second.pytorch/second$ CUDA_VISIBLE_DEVICES=0,1 python ./pytorch/train.py train --config_path=./configs/all.lite.nu.anchor_revised.config --model_dir=/media/ssd_2tb/myGit/second.pytorch/second/model/20190412_nu_all --multi_gpu=True [ 41 2000 2000] num_trainable parameters: 39 False _amp_stash MULTI-GPU: use 2 gpu feature_map_size [1, 250, 250] feature_map_size [1, 250, 250] WORKER 0 seed: 1555059006 WORKER 1 seed: 1555059007 WORKER 2 seed: 1555059008 model: { second: { voxel_generator { point_cloud_range : [-50, -50.0, -4, 50, 50, 2] voxel_size : [0.05, 0.05, 0.15] max_number_of_points_per_voxel : 1 }

voxel_feature_extractor: {
  module_class_name: "SimpleVoxelRadius"
  num_filters: [16]
  with_distance: false
  num_input_features: 3
}
middle_feature_extractor: {
  module_class_name: "SpMiddleFHDLite"
  # num_filters_down1: [] # protobuf don't support empty list.
  # num_filters_down2: []
  downsample_factor: 8
  num_input_features: 2
}
rpn: {
  module_class_name: "RPNV2"
  layer_nums: [5]
  layer_strides: [1]
  num_filters: [128]
  upsample_strides: [1]
  num_upsample_filters: [128]
  use_groupnorm: false
  num_groups: 32
  num_input_features: 128
}
loss: {
  classification_loss: {
    weighted_sigmoid_focal: {
      alpha: 0.25
      gamma: 2.0
      anchorwise_output: true
    }
  }
  localization_loss: {
    weighted_smooth_l1: {
      sigma: 3.0
      code_weight: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
    }
  }
  classification_weight: 1.0
  localization_weight: 2.0
}
num_point_features: 3 # model's num point feature should be independent of dataset
# Outputs
use_sigmoid_score: true
encode_background_as_zeros: true
encode_rad_error_by_sin: true

use_direction_classifier: true # this can help for orientation benchmark
direction_loss_weight: 0.2 # enough.

# Loss
pos_class_weight: 1.0
neg_class_weight: 1.0

loss_norm_type: NormByNumPositives
# Postprocess
post_center_limit_range: [-50, -50, -4.0, 50, 50, 1.0]
use_rotate_nms: true
use_multi_class_nms: false
nms_pre_max_size: 1000
nms_post_max_size: 100
nms_score_threshold: 0.05 # 0.4 in submit, but 0.3 can get better hard performance
nms_iou_threshold: 0.5

box_coder: {
  ground_box3d_coder: {
    linear_dim: false
    encode_angle_vector: false
  }
}
target_assigner: {
  anchor_generators: {
    anchor_generator_range: {
      sizes: [1.95017717, 4.60718145, 1.72270761] # wlh
      anchor_ranges: [-50, -50.0, -0.93897414, 50, 50, -0.93897414]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "car"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.66344886, 0.7256437, 1.75748069] # wlh
      anchor_ranges: [-50, -50.0, -0.73911038, 50, 50, -0.73911038]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "pedestrian"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.60058911, 1.68452161, 1.27192197] # wlh
      anchor_ranges: [-50, -50.0, -1.03743013, 50, 50, -1.03743013]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "bicycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.76279481, 2.09973778, 1.44403034] # wlh
      anchor_ranges: [-50, -50.0, -0.99194854, 50, 50, -0.99194854]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "motorcycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.4560939, 6.73778078, 2.73004906] # wlh
      anchor_ranges: [-50, -50.0, -0.37937912, 50, 50, -0.37937912]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "truck"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.94046906, 11.1885991, 3.47030982] # wlh
      anchor_ranges: [-50, -50.0, -0.0715754, 50, 50, -0.0715754]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "bus"
    }
  }

  sample_positive_fraction : -1
  sample_size : 512
  region_similarity_calculator: {
    nearest_iou_similarity: {
    }
  }
}

} }

train_input_reader: { dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_train.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene"

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

# kitti_root_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini"

}

batch_size: 6 preprocess: { max_number_of_voxels: 63000 shuffle_points: false num_workers: 3 groundtruth_localization_noise_std: [0, 0, 0] groundtruth_rotation_uniform_noise: [0, 0]

# groundtruth_localization_noise_std: [0.25, 0.25, 0.25]
# groundtruth_rotation_uniform_noise: [-0.3141592654, 0.3141592654]
# groundtruth_rotation_uniform_noise: [-0.78539816, 0.78539816]
global_rotation_uniform_noise: [-1.57, 1.57]
global_scaling_uniform_noise: [0.95, 1.05]
global_random_rotation_range_per_object: [0, 0] # pi/4 ~ 3pi/4
global_translate_noise_std: [0.2, 0.2, 0.2]
anchor_area_threshold: -1
remove_points_after_sample: true
groundtruth_points_drop_percentage: 0.0
groundtruth_drop_max_keep_points: 15
remove_unknown_examples: false
remove_environment: false
database_sampler {
  # leave this empty to disable database_sampler, nuscenes don't need sample
  # because 1. the number of ground-truth is enough. 2. sweeps don't support 
  # sample.
}

} }

train_config: { optimizer: { adam_optimizer: { learning_rate: { one_cycle: { lr_max: 3e-3 moms: [0.95, 0.85] div_factor: 10.0 pct_start: 0.4 } } weight_decay: 0.01 } fixed_weight_decay: true use_moving_average: false } steps: 234450 # 4689 50 (28130 // 6 + 1) steps_per_eval: 9378 # 4689 2

steps_per_eval: 500 # 4689 * 2

save_checkpoints_secs : 1800 # half hour save_summary_steps : 10 enable_mixed_precision: false loss_scale_factor: 8.0 clear_metrics_every_epoch: true }

eval_input_reader: { batch_size: 6 dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_val.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene" } preprocess: { max_number_of_voxels: 80000 shuffle_points: false num_workers: 3 anchor_area_threshold: -1 remove_environment: false } }

Traceback (most recent call last): File "./pytorch/train.py", line 505, in fire.Fire() File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, kwargs) File "./pytorch/train.py", line 369, in train raise e File "./pytorch/train.py", line 257, in train ret_dict = net_parallel(example_torch) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, **kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 252, in forward voxel_list.append(voxels[i, :num_voxel]) IndexError: index 6 is out of bounds for dimension 0 with size 6

traveller59 commented 5 years ago

try the newest code, I don't know if this bug is fixed.

robin2002 commented 5 years ago

I updated the source code. But, another error occurred.

(base) robin2002@robin2002-desktop:/media/ssd_2tb/myGit/second.pytorch/second$ CUDA_VISIBLE_DEVICES=0,1 python ./pytorch/train.py train --config_path=./configs/all.lite.nu.anchor_revised.config --model_dir=/media/ssd_2tb/myGit/second.pytorch/second/model/20190412_nu_all --multi_gpu=True [ 41 2000 2000] num_trainable parameters: 39 False _amp_stash MULTI-GPU: use 2 gpu feature_map_size [1, 250, 250] feature_map_size [1, 250, 250] WORKER 0 seed: 1555069348 WORKER 1 seed: 1555069350 WORKER 2 seed: 1555069351 model: { second: { voxel_generator { point_cloud_range : [-50, -50.0, -4, 50, 50, 2] voxel_size : [0.05, 0.05, 0.15] max_number_of_points_per_voxel : 1 }

voxel_feature_extractor: {
  module_class_name: "SimpleVoxelRadius"
  num_filters: [16]
  with_distance: false
  num_input_features: 3
}
middle_feature_extractor: {
  module_class_name: "SpMiddleFHDLite"
  # num_filters_down1: [] # protobuf don't support empty list.
  # num_filters_down2: []
  downsample_factor: 8
  num_input_features: 2
}
rpn: {
  module_class_name: "RPNV2"
  layer_nums: [5]
  layer_strides: [1]
  num_filters: [128]
  upsample_strides: [1]
  num_upsample_filters: [128]
  use_groupnorm: false
  num_groups: 32
  num_input_features: 128
}
loss: {
  classification_loss: {
    weighted_sigmoid_focal: {
      alpha: 0.25
      gamma: 2.0
      anchorwise_output: true
    }
  }
  localization_loss: {
    weighted_smooth_l1: {
      sigma: 3.0
      code_weight: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
    }
  }
  classification_weight: 1.0
  localization_weight: 2.0
}
num_point_features: 3 # model's num point feature should be independent of dataset
# Outputs
use_sigmoid_score: true
encode_background_as_zeros: true
encode_rad_error_by_sin: true

use_direction_classifier: true # this can help for orientation benchmark
direction_loss_weight: 0.2 # enough.

# Loss
pos_class_weight: 1.0
neg_class_weight: 1.0

loss_norm_type: NormByNumPositives
# Postprocess
post_center_limit_range: [-50, -50, -4.0, 50, 50, 1.0]
use_rotate_nms: true
use_multi_class_nms: false
nms_pre_max_size: 1000
nms_post_max_size: 100
nms_score_threshold: 0.05 # 0.4 in submit, but 0.3 can get better hard performance
nms_iou_threshold: 0.5

box_coder: {
  ground_box3d_coder: {
    linear_dim: false
    encode_angle_vector: false
  }
}
target_assigner: {
  anchor_generators: {
    anchor_generator_range: {
      sizes: [1.95017717, 4.60718145, 1.72270761] # wlh
      anchor_ranges: [-50, -50.0, -0.93897414, 50, 50, -0.93897414]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "car"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.66344886, 0.7256437, 1.75748069] # wlh
      anchor_ranges: [-50, -50.0, -0.73911038, 50, 50, -0.73911038]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "pedestrian"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.60058911, 1.68452161, 1.27192197] # wlh
      anchor_ranges: [-50, -50.0, -1.03743013, 50, 50, -1.03743013]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "bicycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.76279481, 2.09973778, 1.44403034] # wlh
      anchor_ranges: [-50, -50.0, -0.99194854, 50, 50, -0.99194854]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "motorcycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.4560939, 6.73778078, 2.73004906] # wlh
      anchor_ranges: [-50, -50.0, -0.37937912, 50, 50, -0.37937912]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "truck"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.94046906, 11.1885991, 3.47030982] # wlh
      anchor_ranges: [-50, -50.0, -0.0715754, 50, 50, -0.0715754]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "bus"
    }
  }

  sample_positive_fraction : -1
  sample_size : 512
  region_similarity_calculator: {
    nearest_iou_similarity: {
    }
  }
}

} }

train_input_reader: { dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_train.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene"

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

# kitti_root_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini"

}

batch_size: 6 preprocess: { max_number_of_voxels: 63000 shuffle_points: false num_workers: 3 groundtruth_localization_noise_std: [0, 0, 0] groundtruth_rotation_uniform_noise: [0, 0]

# groundtruth_localization_noise_std: [0.25, 0.25, 0.25]
# groundtruth_rotation_uniform_noise: [-0.3141592654, 0.3141592654]
# groundtruth_rotation_uniform_noise: [-0.78539816, 0.78539816]
global_rotation_uniform_noise: [-1.57, 1.57]
global_scaling_uniform_noise: [0.95, 1.05]
global_random_rotation_range_per_object: [0, 0] # pi/4 ~ 3pi/4
global_translate_noise_std: [0.2, 0.2, 0.2]
anchor_area_threshold: -1
remove_points_after_sample: true
groundtruth_points_drop_percentage: 0.0
groundtruth_drop_max_keep_points: 15
remove_unknown_examples: false
remove_environment: false
database_sampler {
  # leave this empty to disable database_sampler, nuscenes don't need sample
  # because 1. the number of ground-truth is enough. 2. sweeps don't support 
  # sample.
}

} }

train_config: { optimizer: { adam_optimizer: { learning_rate: { one_cycle: { lr_max: 3e-3 moms: [0.95, 0.85] div_factor: 10.0 pct_start: 0.4 } } weight_decay: 0.01 } fixed_weight_decay: true use_moving_average: false } steps: 234450 # 4689 50 (28130 // 6 + 1) steps_per_eval: 9378 # 4689 2

steps_per_eval: 500 # 4689 * 2

save_checkpoints_secs : 1800 # half hour save_summary_steps : 10 enable_mixed_precision: false loss_scale_factor: 8.0 clear_metrics_every_epoch: true }

eval_input_reader: { batch_size: 6 dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_val.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene" } preprocess: { max_number_of_voxels: 80000 shuffle_points: false num_workers: 3 anchor_area_threshold: -1 remove_environment: false } }

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1550802451070/work/aten/src/THC/generic/THCTensorMath.cu line=14 error=77 : an illegal memory access was encountered Traceback (most recent call last): File "./pytorch/train.py", line 259, in train ret_dict = net_parallel(example_torch) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 274, in forward voxel_features, coors, batch_size_dev) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/middle.py", line 1021, in forward ret = self.middle_conv(ret) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/modules.py", line 123, in forward input = module(input) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/conv.py", line 155, in forward self.stride, self.padding, self.dilation, self.output_padding, self.subm, self.transposed, grid=input.grid) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/ops.py", line 89, in get_indice_pairs stride, padding, dilation, out_padding, int(subm), int(transpose)) RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch_1550802451070/work/aten/src/THC/generic/THCTensorMath.cu:14

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "./pytorch/train.py", line 507, in fire.Fire() File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "./pytorch/train.py", line 370, in train net.get_global_step()) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 235, in get_global_step return int(self.global_step.cpu().numpy()[0]) RuntimeError: CUDA error: an illegal memory access was encountered

traveller59 commented 5 years ago

Could you change some code in test_conv.py in spconv and run it?

change "cuda:0" to "cuda:1" in main()
remove import sparseconvnet as scn
change unittest.main() to main()

robin2002 commented 5 years ago

As you recommend, I revised a test_conv.py in spconv and recompiled and reinstalled it. Then, I got this error message as below:

(base) robin2002@robin2002-desktop:/media/ssd_2tb/myGit/second.pytorch/second$ CUDA_VISIBLE_DEVICES=0,1 python ./pytorch/train.py train --config_path=./configs/all.lite.nu.anchor_revised.config --model_dir=/media/ssd_2tb/myGit/second.pytorch/second/model/20190412_nu_all --multi_gpu=True [ 41 2000 2000] num_trainable parameters: 39 False _amp_stash MULTI-GPU: use 2 gpu feature_map_size [1, 250, 250] feature_map_size [1, 250, 250] WORKER 0 seed: 1555119356 WORKER 1 seed: 1555119357 WORKER 2 seed: 1555119358 model: { second: { voxel_generator { point_cloud_range : [-50, -50.0, -4, 50, 50, 2] voxel_size : [0.05, 0.05, 0.15] max_number_of_points_per_voxel : 1 }

voxel_feature_extractor: {
  module_class_name: "SimpleVoxelRadius"
  num_filters: [16]
  with_distance: false
  num_input_features: 3
}
middle_feature_extractor: {
  module_class_name: "SpMiddleFHDLite"
  # num_filters_down1: [] # protobuf don't support empty list.
  # num_filters_down2: []
  downsample_factor: 8
  num_input_features: 2
}
rpn: {
  module_class_name: "RPNV2"
  layer_nums: [5]
  layer_strides: [1]
  num_filters: [128]
  upsample_strides: [1]
  num_upsample_filters: [128]
  use_groupnorm: false
  num_groups: 32
  num_input_features: 128
}
loss: {
  classification_loss: {
    weighted_sigmoid_focal: {
      alpha: 0.25
      gamma: 2.0
      anchorwise_output: true
    }
  }
  localization_loss: {
    weighted_smooth_l1: {
      sigma: 3.0
      code_weight: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
    }
  }
  classification_weight: 1.0
  localization_weight: 2.0
}
num_point_features: 3 # model's num point feature should be independent of dataset
# Outputs
use_sigmoid_score: true
encode_background_as_zeros: true
encode_rad_error_by_sin: true

use_direction_classifier: true # this can help for orientation benchmark
direction_loss_weight: 0.2 # enough.

# Loss
pos_class_weight: 1.0
neg_class_weight: 1.0

loss_norm_type: NormByNumPositives
# Postprocess
post_center_limit_range: [-50, -50, -4.0, 50, 50, 1.0]
use_rotate_nms: true
use_multi_class_nms: false
nms_pre_max_size: 1000
nms_post_max_size: 100
nms_score_threshold: 0.05 # 0.4 in submit, but 0.3 can get better hard performance
nms_iou_threshold: 0.5

box_coder: {
  ground_box3d_coder: {
    linear_dim: false
    encode_angle_vector: false
  }
}
target_assigner: {
  anchor_generators: {
    anchor_generator_range: {
      sizes: [1.95017717, 4.60718145, 1.72270761] # wlh
      anchor_ranges: [-50, -50.0, -0.93897414, 50, 50, -0.93897414]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "car"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.66344886, 0.7256437, 1.75748069] # wlh
      anchor_ranges: [-50, -50.0, -0.73911038, 50, 50, -0.73911038]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "pedestrian"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.60058911, 1.68452161, 1.27192197] # wlh
      anchor_ranges: [-50, -50.0, -1.03743013, 50, 50, -1.03743013]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "bicycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.76279481, 2.09973778, 1.44403034] # wlh
      anchor_ranges: [-50, -50.0, -0.99194854, 50, 50, -0.99194854]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.35
      unmatched_threshold : 0.2
      class_name: "motorcycle"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.4560939, 6.73778078, 2.73004906] # wlh
      anchor_ranges: [-50, -50.0, -0.37937912, 50, 50, -0.37937912]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "truck"
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.94046906, 11.1885991, 3.47030982] # wlh
      anchor_ranges: [-50, -50.0, -0.0715754, 50, 50, -0.0715754]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "bus"
    }
  }

  sample_positive_fraction : -1
  sample_size : 512
  region_similarity_calculator: {
    nearest_iou_similarity: {
    }
  }
}

} }

train_input_reader: { dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_train.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene"

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

# kitti_root_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini"

}

batch_size: 6 preprocess: { max_number_of_voxels: 63000 shuffle_points: false num_workers: 3 groundtruth_localization_noise_std: [0, 0, 0] groundtruth_rotation_uniform_noise: [0, 0]

# groundtruth_localization_noise_std: [0.25, 0.25, 0.25]
# groundtruth_rotation_uniform_noise: [-0.3141592654, 0.3141592654]
# groundtruth_rotation_uniform_noise: [-0.78539816, 0.78539816]
global_rotation_uniform_noise: [-1.57, 1.57]
global_scaling_uniform_noise: [0.95, 1.05]
global_random_rotation_range_per_object: [0, 0] # pi/4 ~ 3pi/4
global_translate_noise_std: [0.2, 0.2, 0.2]
anchor_area_threshold: -1
remove_points_after_sample: true
groundtruth_points_drop_percentage: 0.0
groundtruth_drop_max_keep_points: 15
remove_unknown_examples: false
remove_environment: false
database_sampler {
  # leave this empty to disable database_sampler, nuscenes don't need sample
  # because 1. the number of ground-truth is enough. 2. sweeps don't support 
  # sample.
}

} }

train_config: { optimizer: { adam_optimizer: { learning_rate: { one_cycle: { lr_max: 3e-3 moms: [0.95, 0.85] div_factor: 10.0 pct_start: 0.4 } } weight_decay: 0.01 } fixed_weight_decay: true use_moving_average: false } steps: 234450 # 4689 50 (28130 // 6 + 1) steps_per_eval: 9378 # 4689 2

steps_per_eval: 500 # 4689 * 2

save_checkpoints_secs : 1800 # half hour save_summary_steps : 10 enable_mixed_precision: false loss_scale_factor: 8.0 clear_metrics_every_epoch: true }

eval_input_reader: { batch_size: 6 dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_val.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene" } preprocess: { max_number_of_voxels: 80000 shuffle_points: false num_workers: 3 anchor_area_threshold: -1 remove_environment: false } }

Traceback (most recent call last): File "./pytorch/train.py", line 259, in train ret_dict = net_parallel(example_torch) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 274, in forward voxel_features, coors, batch_size_dev) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/middle.py", line 1021, in forward ret = self.middle_conv(ret) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/modules.py", line 123, in forward input = module(input) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/conv.py", line 155, in forward self.stride, self.padding, self.dilation, self.output_padding, self.subm, self.transposed, grid=input.grid) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/ops.py", line 89, in get_indice_pairs stride, padding, dilation, out_padding, int(subm), int(transpose)) RuntimeError: CUDA error: an illegal memory access was encountered (free_blocks at /opt/conda/conda-bld/pytorch_1550802451070/work/aten/src/THC/THCCachingAllocator.cpp:439) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6d (0x7f9cf1d6969d in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: + 0xdae1f5 (0x7f9c8e02e1f5 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #2: + 0xdae5dd (0x7f9c8e02e5dd in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #3: at::native::empty_cuda(c10::ArrayRef, at::TensorOptions const&) + 0x228 (0x7f9c8f9a87a8 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #4: at::CUDAIntType::empty(c10::ArrayRef, at::TensorOptions const&) const + 0x66 (0x7f9c8dfb0b36 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2_gpu.so) frame #5: at::native::full(c10::ArrayRef, c10::Scalar, at::TensorOptions const&) + 0x69 (0x7f9ce2ca7d29 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libcaffe2.so) frame #6: torch::full(c10::ArrayRef, c10::Scalar, at::TensorOptions const&) + 0x22b (0x7f9c4c43e77b in /home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/libspconv.so) frame #7: std::vector<at::Tensor, std::allocator > spconv::getIndicePair<3u>(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long) + 0x24f (0x7f9c4c4409ff in /home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/libspconv.so) frame #8: void torch::jit::detail::callOperatorWithTuple<std::vector<at::Tensor, std::allocator > ( const)(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long), at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long, 0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul>(c10::FunctionSchema const&, std::vector<at::Tensor, std::allocator > ( const&&)(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long), std::vector<c10::IValue, std::allocator >&, std::tuple<at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long>&, torch::Indices<0ul, 1ul, 2ul, 3ul, 4ul, 5ul, 6ul, 7ul, 8ul, 9ul, 10ul>) + 0x5aa (0x7f9c4c44d5aa in /home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/libspconv.so) frame #9: std::_Function_handler<int (std::vector<c10::IValue, std::allocator >&), torch::jit::createOperator<std::vector<at::Tensor, std::allocator > ()(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long)>(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::vector<at::Tensor, std::allocator > (&&)(at::Tensor, long, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, std::vector<long, std::allocator >, long, long))::{lambda(std::vector<c10::IValue, std::allocator >&)#1}>::_M_invoke(std::_Any_data const&, std::vector<c10::IValue, std::allocator >&) + 0x127 (0x7f9c4c44d9d7 in /home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/libspconv.so) frame #10: + 0x33c532 (0x7f9ce600e532 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #11: + 0x33c785 (0x7f9ce600e785 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #12: + 0xfd760 (0x7f9ce5dcf760 in /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

During handling of the above exception, another exception occurred: Traceback (most recent call last): File "./pytorch/train.py", line 507, in fire.Fire() File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, **kwargs) File "./pytorch/train.py", line 370, in train net.get_global_step()) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 235, in get_global_step return int(self.global_step.cpu().numpy()[0]) RuntimeError: CUDA error: an illegal memory access was encountered

traveller59 commented 5 years ago

run "test_conv.py", not reinstall spconv and rerun train... I want to know if this is a spconv problem.

robin2002 commented 5 years ago

I'm sorry, I misunderstood. This is the result.

(base) robin2002@robin2002-desktop:~/projects/spconv/test$ python test_conv.py Traceback (most recent call last): File "test_conv.py", line 615, in main() File "test_conv.py", line 604, in main out = net(features_t, indices_t, bs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "test_conv.py", line 56, in forward return self.net(x)# .dense() File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/modules.py", line 123, in forward input = module(input) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/conv.py", line 155, in forward self.stride, self.padding, self.dilation, self.output_padding, self.subm, self.transposed, grid=input.grid) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/spconv/ops.py", line 89, in get_indice_pairs stride, padding, dilation, out_padding, int(subm), int(transpose)) RuntimeError: /home/robin2002/projects/spconv/include/tensorview/helper_launch.h 17 N > 0 assert faild. CUDA kernel launch blocks must be positive, but got N= 0

traveller59 commented 5 years ago

you can try pointpillars with multi-gpu first, this problem requires c++ debugging to find out which function have problem, I don't have multi-gpu environment to do this. If you are interest in debugging c++ code, you can add some print function to find out which getBlocks function in CreateConvIndicePairFunctorP1 and CreateConvIndicePairFunctorP2 throw that error.

zhixinwang commented 5 years ago

The bug is from spconv. It seems spconv cannot work in multi-gpu setting. https://github.com/traveller59/spconv/issues/35

robin2002 commented 5 years ago

As you recommend, I tried to train pointpillars with multi-gpu. But it was not working.

(base) robin2002@robin2002-desktop:/media/ssd_2tb/myGit/second.pytorch/second$ CUDA_VISIBLE_DEVICES=0,1 python ./pytorch/train.py train --config_path=./configs/nuscenes/all.pp.config --model_dir=/media/ssd_2tb/myGit/second.pytorch/second/model/20190425_pp_nu --multi_gpu=Truenum parameters: 66 False _amp_stash MULTI-GPU: use 2 gpu feature_map_size [1, 248, 248] feature_map_size [1, 248, 248] model: { second: { voxel_generator { point_cloud_range : [-49.6, -49.6, -5, 49.6, 49.6, 3] voxel_size : [0.2, 0.2, 8] max_number_of_points_per_voxel : 40 } voxel_feature_extractor: { module_class_name: "PillarFeatureNet" num_filters: [64] with_distance: false num_input_features: 4 } middle_feature_extractor: { module_class_name: "PointPillarsScatter" downsample_factor: 1 num_input_features: 64 } rpn: { module_class_name: "RPNV2" layer_nums: [3, 5, 5] layer_strides: [2, 2, 2] num_filters: [64, 128, 256] upsample_strides: [1, 2, 4] num_upsample_filters: [128, 128, 128] use_groupnorm: false num_groups: 32 num_input_features: 64 } loss: { classification_loss: { weighted_sigmoid_focal: { alpha: 0.25 gamma: 2.0 anchorwise_output: true } } localization_loss: { weighted_smooth_l1: { sigma: 3.0 code_weight: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] } } classification_weight: 1.0 localization_weight: 2.0 } num_point_features: 4 # model's num point feature should be independent of dataset

Outputs

use_sigmoid_score: true
encode_background_as_zeros: true
encode_rad_error_by_sin: true

use_direction_classifier: true
direction_loss_weight: 0.2

# Loss
pos_class_weight: 1.0
neg_class_weight: 1.0

loss_norm_type: NormByNumPositives
# Postprocess
post_center_limit_range: [-59.6, -59.6, -6, 59.6, 59.6, 4]
use_rotate_nms: false
use_multi_class_nms: false
nms_pre_max_size: 1000
nms_post_max_size: 300
nms_score_threshold: 0.05
nms_iou_threshold: 0.5

box_coder: {
  ground_box3d_coder: {
    linear_dim: false
    encode_angle_vector: false
  }
}
target_assigner: {
  anchor_generators: {
    anchor_generator_range: {
      sizes: [1.95017717, 4.60718145, 1.72270761] # wlh
      anchor_ranges: [-49.6, -49.6, -0.93897414, 49.6, 49.6, -0.93897414]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "car"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.60058911, 1.68452161, 1.27192197] # wlh
      anchor_ranges: [-49.6, -49.6, -1.03743013, 49.6, 49.6, -1.03743013]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.4
      unmatched_threshold : 0.2
      class_name: "bicycle"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }

  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.94046906, 11.1885991, 3.47030982] # wlh
      anchor_ranges: [-49.6, -49.6, -0.0715754, 49.6, 49.6, -0.0715754]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.7
      unmatched_threshold : 0.4
      class_name: "bus"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }

  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.73050468, 6.38352896, 3.13312415] # wlh
      anchor_ranges: [-49.6, -49.6, -0.08168083, 49.6, 49.6, -0.08168083]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "construction_vehicle"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }

  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.76279481, 2.09973778, 1.44403034] # wlh
      anchor_ranges: [-49.6, -49.6, -0.99194854, 49.6, 49.6, -0.99194854]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.4
      unmatched_threshold : 0.2
      class_name: "motorcycle"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }

  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.66344886, 0.7256437, 1.75748069] # wlh
      anchor_ranges: [-49.6, -49.6, -0.73911038, 49.6, 49.6, -0.73911038]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.4
      unmatched_threshold : 0.2
      class_name: "pedestrian"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }

  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [0.39694519, 0.40359262, 1.06232151] # wlh
      anchor_ranges: [-49.6, -49.6, -1.27868911, 49.6, 49.6, -1.27868911]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.3
      unmatched_threshold : 0.15
      class_name: "traffic_cone"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.87427237, 12.01320693, 3.81509561] # wlh
      anchor_ranges: [-49.6, -49.6, 0.22228277, 49.6, 49.6, 0.22228277]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "trailer"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }
  }
  anchor_generators: {
    anchor_generator_range: {
      sizes: [2.4560939, 6.73778078, 2.73004906] # wlh
      anchor_ranges: [-49.6, -49.6, -0.37937912, 49.6, 49.6, -0.37937912]
      rotations: [0, 1.57] # DON'T modify this unless you are very familiar with my code.
      matched_threshold : 0.6
      unmatched_threshold : 0.45
      class_name: "truck"
    }
    region_similarity_calculator: {
      nearest_iou_similarity: {
      }
    }
  }
  sample_positive_fraction : -1
  sample_size : 512
}

} }

train_input_reader: { dataset: { dataset_class_name: "NuScenesDatasetD8" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_train.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene" }

batch_size: 3 preprocess: { max_number_of_voxels: 30000 shuffle_points: false num_workers: 4 groundtruth_localization_noise_std: [0, 0, 0] groundtruth_rotation_uniform_noise: [0, 0]

groundtruth_localization_noise_std: [0.25, 0.25, 0.25]

# groundtruth_rotation_uniform_noise: [-0.15707963267, 0.15707963267]
# global_rotation_uniform_noise: [-0.78539816, 0.78539816]
global_rotation_uniform_noise: [0, 0]
global_scaling_uniform_noise: [0.95, 1.05]
global_random_rotation_range_per_object: [0, 0]
global_translate_noise_std: [0.2, 0.2, 0.2]
anchor_area_threshold: -1
remove_points_after_sample: false
groundtruth_points_drop_percentage: 0.0
groundtruth_drop_max_keep_points: 15
remove_unknown_examples: false
remove_environment: false
database_sampler {
}

} }

train_config: { optimizer: { adam_optimizer: { learning_rate: { one_cycle: { lr_max: 3e-3 moms: [0.95, 0.85] div_factor: 10.0 pct_start: 0.4 } } weight_decay: 0.01 } fixed_weight_decay: true use_moving_average: false } steps: 58650 # 1173 50 (3517 // 3 + 1) steps_per_eval: 5865 # 1173 5 save_checkpoints_secs : 1800 # half hour save_summary_steps : 10 enable_mixed_precision: false loss_scale_factor: -1 clear_metrics_every_epoch: true }

eval_input_reader: { dataset: { dataset_class_name: "NuScenesDataset" kitti_info_path: "/media/ssd_2tb/dataset/nuscene/infos_val.pkl" kitti_root_path: "/media/ssd_2tb/dataset/nuscene" } batch_size: 1

preprocess: { max_number_of_voxels: 40000 shuffle_points: false num_workers: 3 anchor_area_threshold: -1 remove_environment: false } }

WORKER 0 seed: 1556235365 WORKER 1 seed: 1556235366 WORKER 2 seed: 1556235367 WORKER 3 seed: 1556235368 WORKER 4 seed: 1556235369 WORKER 5 seed: 1556235370 WORKER 6 seed: 1556235371 WORKER 7 seed: 1556235372 /home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/_reduction.py:49: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead. warnings.warn(warning.format(ret)) binary_op(): expected both inputs to be on same device, but input a is on cuda:0 and input b is on cuda:1 Traceback (most recent call last): File "./pytorch/train.py", line 541, in fire.Fire() File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, kwargs) File "./pytorch/train.py", line 419, in train raise e File "./pytorch/train.py", line 306, in train ret_dict = net_parallel(example_torch) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 143, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(input, kwargs) File "/home/robin2002/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 296, in forward box_code_size=self._box_coder.code_size, File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/models/voxelnet.py", line 643, in create_loss box_preds, reg_targets, weights=reg_weights) # [N, M] File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/core/losses.py", line 81, in call return self._compute_loss(prediction_tensor, target_tensor, params) File "/media/ssd_2tb/myGit/second.pytorch/second/pytorch/core/losses.py", line 169, in _compute_loss diff = code_weights.view(1, 1, -1) diff RuntimeError: binary_op(): expected both inputs to be on same device, but input a is on cuda:0 and input b is on cuda:1

traveller59 / second.pytorch

Multi-gpu training is not working!! #154

I confirmed the log.txt file. There is only a script for a model architecture. So I copy the full message after I run the training script:

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

steps_per_eval: 500 # 4689 * 2

I updated the source code. But, another error occurred.

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

steps_per_eval: 500 # 4689 * 2

As you recommend, I revised a test_conv.py in spconv and recompiled and reinstalled it. Then, I got this error message as below:

kitti_info_path: "/media/yy/960evo/datasets/nuscene/v1.0-mini/infos_train.pkl"

steps_per_eval: 500 # 4689 * 2

I'm sorry, I misunderstood. This is the result.

As you recommend, I tried to train pointpillars with multi-gpu. But it was not working.

Outputs

groundtruth_localization_noise_std: [0.25, 0.25, 0.25]