evaluation cuda problems

zwyzwy commented 5 years ago

when I training the model, the middle evaluation occurred the error below:

Traceback (most recent call last): File "vox_gluon/train_gluon.py", line 759, in fire.Fire() File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/fire/core.py", line 127, in Fire component_trace = _Fire(component, args, context, name) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/fire/core.py", line 366, in _Fire component, remaining_args) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/fire/core.py", line 542, in _CallCallable result = fn(*varargs, *kwargs) File "vox_gluon/train_gluon.py", line 504, in train raise e File "vox_gluon/train_gluon.py", line 486, in train result = get_official_eval_result(gt_annos[:len(gt_annos)-1], dt_annos, class_names) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/utils/eval.py", line 824, in get_official_eval_result mAPbbox, mAPbev, mAP3d, mAPaos = do_eval_v2(gt_annos, dt_annos, current_classes, min_overlaps, compute_aos, difficultys) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/utils/eval.py", line 701, in do_eval_v2 ret = eval_class_v3(gt_annos, dt_annos, current_classes, difficultys, 1, min_overlaps) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/utils/eval.py", line 574, in eval_class_v3 rets = calculate_iou_partly(dt_annos, gt_annos, metric, num_parts) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/utils/eval.py", line 384, in calculate_iou_partly overlap_part = bev_box_overlap(gt_boxes, dt_boxes).astype(np.float64) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/utils/eval.py", line 126, in bev_box_overlap riou = rotate_iou_gpu_eval(boxes, qboxes, criterion) File "/mnt/data-3/data/wenyong.zheng/vxlnet/second.pytorch/second/core/non_max_suppression/nms_gpu.py", line 652, in rotate_iou_gpu_eval N, K, boxes_dev, query_boxes_dev, iou_dev, criterion) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/compiler.py", line 484, in call sharedmem=self.sharedmem) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/compiler.py", line 558, in _kernel_call cu_func(kernelargs) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 1301, in call self.sharedmem, streamhandle, args) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 1345, in launch_kernel None) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 288, in safe_cuda_api_call self._check_error(fname, retcode) File "/home/users/wenyong.zheng/anaconda3/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py", line 323, in _check_error raise CudaAPIError(retcode, msg) numba.cuda.cudadrv.driver.CudaAPIError: [400] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_HANDLE

By the way, have you done the training by multi GPUs ?

traveller59 commented 5 years ago

This problem I have no idea, I will try to create a docker for this project to provide a reproducible environment for errors. Multi GPU: currently not supported. The major reason is I only have one GPU. If you want to use multi gpu training, you need to pad the input (or just not slice array in point_to_voxel), then slice points inside module.

zwyzwy commented 5 years ago

what do you mean in "slice array in point_to_voxel" and "slice points inside module"? as the code shows that you put all the points in one single batch together, how can I recognize how many points in a sample and others ?

traveller59 commented 5 years ago

The number of voxels converted from points is not fixed, you can see a slice operation in point_to_voxel . For multi-gpu, you need to return voxel_num in point_to_voxel, use fixed-size input before nn.DataParallel, passvoxel_num as a Tensor and gather all valid voxels inside nn.Module in nn.DataParallel.

jiangzhengkai commented 5 years ago

@zwyzwy have you any solution?

traveller59 / second.pytorch

evaluation cuda problems #44