tusen-ai / SST

Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral).
Apache License 2.0
785 stars 100 forks source link

Training on kitti dataset #20

Closed QYChan closed 2 years ago

QYChan commented 2 years ago

Hi, I tried to train SST on kitti dataset. According to some previous issues in this repo, I modified the config file. But It did not work. And I noticed that the error happen when the dynamic voxelize layer received the point clouds of shape [0, 4]. I print the "empty" point cloud file name, but it was not empty actually. I am so confuse about this error. Could you give me some suggestion to fix it? Thank you in advance!


2022-03-17 21:00:37,919 - mmdet - INFO - Checkpoints will be saved to /data1/cqy/code/SST/work_dirs/sst_kittiD1_1x_3class_8heads/3161933 by HardDiskBackend.
2022-03-17 21:00:38.164014: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
drop_info is set to {0: {'max_tokens': 30, 'drop_range': (0, 30)}, 1: {'max_tokens': 60, 'drop_range': (30, 60)}, 2: {'max_tokens': 100, 'drop_range': (60, 100000)}}, in input_layer
/opt/anaconda3/envs/SST/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
2022-03-17 21:00:54,870 - mmcv - INFO - Reducer buckets have been rebuilt in this iteration.
No voxel belongs to drop_level:1 in shift 0
No voxel belongs to drop_level:2 in shift 1
No voxel belongs to drop_level:1 in shift 1
No voxel belongs to drop_level:2 in shift 0
2022-03-17 21:01:12,848 - mmdet - INFO - Epoch [1][50/7424]     lr: 1.000e-03, eta: 1 day, 17:31:34, time: 0.671, data_time: 0.293, memory: 3440, loss_cls: 25.1039, loss_bbox: 0.8919, loss_dir: 0.2588, loss: 26.2546, grad_norm: 138.2142
2022-03-17 21:01:31,472 - mmdet - INFO - Epoch [1][100/7424]    lr: 1.000e-03, eta: 1 day, 8:16:30, time: 0.372, data_time: 0.003, memory: 3440, loss_cls: 0.6193, loss_bbox: 0.5477, loss_dir: 0.1365, loss: 1.3034, grad_norm: 8.0083
2022-03-17 21:01:49,711 - mmdet - INFO - Epoch [1][150/7424]    lr: 1.000e-03, eta: 1 day, 5:01:46, time: 0.365, data_time: 0.003, memory: 3440, loss_cls: 0.5878, loss_bbox: 0.5475, loss_dir: 0.1352, loss: 1.2706, grad_norm: 7.4713
2022-03-17 21:02:08,372 - mmdet - INFO - Epoch [1][200/7424]    lr: 1.000e-03, eta: 1 day, 3:32:03, time: 0.373, data_time: 0.003, memory: 3440, loss_cls: 0.5782, loss_bbox: 0.5325, loss_dir: 0.1384, loss: 1.2492, grad_norm: 6.6802
data/kitti/training/velodyne_reduced/006279.bin
Traceback (most recent call last):
  File "tools/train.py", line 230, in <module>
    main()
  File "tools/train.py", line 220, in main
    train_model(
  File "/data1/cqy/code/SST/mmdet3d/apis/train.py", line 27, in train_model
    train_detector(
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 29, in run_iter
    outputs = self.model.train_step(data_batch, self.optimizer,
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmdet/models/detectors/base.py", line 237, in train_step
    losses = self(**data)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "/data1/cqy/code/SST/mmdet3d/models/detectors/base.py", line 58, in forward
    return self.forward_train(**kwargs)
  File "/data1/cqy/code/SST/mmdet3d/models/detectors/voxelnet.py", line 90, in forward_train
    x = self.extract_feat(points, img_metas)
  File "/data1/cqy/code/SST/mmdet3d/models/detectors/dynamic_voxelnet.py", line 39, in extract_feat
    voxels, coors = self.voxelize(points)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 186, in new_func
    return old_func(*args, **kwargs)
  File "/data1/cqy/code/SST/mmdet3d/models/detectors/dynamic_voxelnet.py", line 62, in voxelize
    res_coors = self.voxel_layer(res)
  File "/opt/anaconda3/envs/SST/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/cqy/code/SST/mmdet3d/ops/voxel/voxelize.py", line 113, in forward
    return voxelization(input, self.voxel_size, self.point_cloud_range,
  File "/data1/cqy/code/SST/mmdet3d/ops/voxel/voxelize.py", line 44, in forward
    dynamic_voxelize(points, coors, voxel_size, coors_range, 3)
RuntimeError: CUDA error: invalid configuration argument
Abyssaledge commented 2 years ago

Thanks for using SST. I guess the data preprocessing pipeline discard some points out of the point_cloud_range, so no points come into the network. And I suggest to check if the point cloud is empty in the following stages:

  1. right after reading the point cloud.
  2. at the very beginning of network.
  3. at the beginning of the voxelization layer. Let me know if you have further issues.
QYChan commented 2 years ago

Thanks for your apply! I print the shape of points at every begining and ending of data prepare pipeline. The num of points become zero after PointsRangeFilter pipeline.


at loading file end: torch.Size([19076, 4])
ObjectSample begin :torch.Size([19076, 4])
ObjectSample end: torch.Size([22446, 4])
ObjectNoise begin: torch.Size([22446, 4])
ObjectNoise end: torch.Size([22446, 4])
PointsRangeFilter begin:torch.Size([22446, 4])
pcd range:[  0.   -39.68  -3.    69.12  39.68   1.  ]
PointsRangeFilter end:torch.Size([0, 4])
Abyssaledge commented 2 years ago

It's weird because the point_cloud_range seems right. I may want to visualize the point cloud to see what it exactly looks like if I were you.

QYChan commented 2 years ago

I use mayavi to visualize this bin file. It seems normal. pointcloud

Abyssaledge commented 2 years ago

How about debugging inside the point filter function step by step to see what really happened?

发自我的iPhone

------------------ Original ------------------ From: QYChan @.> Date: Fri,Mar 18,2022 11:04 PM To: TuSimple/SST @.> Cc: Lue Fan @.>, Comment @.> Subject: Re: [TuSimple/SST] Training on kitti dataset (Issue #20)

I use mayavi to visualize this bin file. It seems normal.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

Abyssaledge commented 2 years ago

Has your problem been solved? @QYChan

QYChan commented 2 years ago

It is a quite confuse things. This time the file 006279.bin is processed normally. But the same error appear in the file 001156.bin. I don't modify any code and just reboot the meachine. It looks like a random behavior.