thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds
MIT License
339 stars 80 forks source link

"AssertionError: empty batch" error when training your own dataset #181

Closed huanghuang113 closed 10 months ago

huanghuang113 commented 11 months ago

Dear Thang Vu, Thank you very much for your contribution in 3D point cloud instance segmentation. I am having some problems with your model and would like you to give me some advice. Data description: my data is a set of plant point cloud with only two semantic classes, leaf and stalk, I have some problems while training after processing my dataset according to the preprocessing of stpls3D dataset: 2023-08-01 14:09:55,900 - INFO - Config: model: channels: 32 num_blocks: 3 semantic_classes: 2 instance_classes: 2 sem2ins_classes: [] semantic_only: True semantic_weight: [1.0, 1.0] # 根据类别数进行设置,这里是两个类别,权重均为1.0 with_coords: False ignore_label: -100 grouping_cfg: score_thr: 0.2 radius: 0.04 # 降低体素尺寸,以适应您的数据集大小 mean_active: 300 class_numpoint_mean: [1.0, 13624.0] # 指定每个类别的体素平均点数,这里是两个类别,设置为500 npoint_thr: 0.05 ignore_classes: [] instance_voxel_cfg: scale: 50 # 降低体素尺寸,以适应您的数据集大小 spatial_shape: 20 train_cfg: max_proposal_num: 200 pos_iou_thr: 0.5 test_cfg: x4_split: False cls_score_thr: 0.001 mask_score_thr: -0.5 min_npoint: 100 eval_tasks: ['semantic']

data: train: type: 'plant' data_root: 'dataset/plant' prefix: 'train' suffix: '.pth' training: True repeat: 5 # 重复的次数 voxel_cfg: scale: 50 # 降低体素尺寸,以适应您的数据集大小 spatial_shape: [128,512] # 调整体素网格大小 max_npoint: 250000 # 调整体素中最大点数 min_npoint: 5000 # 调整体素中最小点数 test: type: 'plant' data_root: 'dataset/plant' prefix: 'val_250m' suffix: '.pth' training: False voxel_cfg: scale: 50 # 降低体素尺寸,以适应您的数据集大小 spatial_shape: [128, 512] # 调整体素网格大小 max_npoint: 250000 # 调整体素中最大点数 min_npoint: 5000 # 调整体素中最小点数

dataloader: train: batch_size: 2 num_workers: 4 test: batch_size: 1 num_workers: 1

optimizer: type: 'Adam' lr: 0.004

fp16: False epochs: 5 step_epoch: 0 save_freq: 2 pretrain: '' work_dir: ''

2023-08-01 14:09:55,900 - INFO - Distributed: False 2023-08-01 14:09:55,900 - INFO - Mix precision training: False 2023-08-01 14:10:01,103 - INFO - Load train dataset: 1720 scans 2023-08-01 14:10:01,103 - INFO - Load test dataset: 86 scans 2023-08-01 14:10:01,104 - INFO - Training 2023-08-01 14:10:01,695 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:03,778 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:04,734 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:05,422 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:05,779 - INFO - Epoch [1/5][10/860] lr: 0.004, eta: 0:33:24, mem: 449, data_time: 0.00, iter_time: 0.05, semantic_loss: 0.5232, offset_loss: 0.1780, loss: 0.7011 2023-08-01 14:10:05,920 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:07,859 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:09,642 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:09,860 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:10,891 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:11,196 - INFO - Epoch [1/5][20/860] lr: 0.004, eta: 0:35:39, mem: 449, data_time: 1.11, iter_time: 1.30, semantic_loss: 0.3730, offset_loss: 0.2148, loss: 0.5878 2023-08-01 14:10:15,136 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:15,641 - INFO - Epoch [1/5][30/860] lr: 0.004, eta: 0:34:20, mem: 449, data_time: 0.00, iter_time: 0.13, semantic_loss: 0.1968, offset_loss: 0.2220, loss: 0.4187 2023-08-01 14:10:15,793 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:16,407 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:20,568 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:20,630 - INFO - batch is truncated from size 2 to 1 2023-08-01 14:10:20,661 - INFO - Epoch [1/5][40/860] lr: 0.004, eta: 0:34:40, mem: 449, data_time: 3.90, iter_time: 3.96, semantic_loss: 0.4106, offset_loss: 0.2034, loss: 0.6140 2023-08-01 14:10:20,688 - INFO - batch is truncated from size 2 to 1 Traceback (most recent call last): File "/GitProject/SoftGroup/tools/train.py", line 207, in main() File "/GitProject/SoftGroup/tools/train.py", line 200, in main train(epoch, model, optimizer, scaler, train_loader, cfg, logger, writer) File "/GitProject/SoftGroup/tools/train.py", line 44, in train for i, batch in enumerate(train_loader, start=1): File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1204, in _next_data return self._process_data(data) File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise raise exception AssertionError: Caught AssertionError in DataLoader worker process 2. Original Traceback (most recent call last): File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/hhroot/anaconda3/envs/softg/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/GitProject/SoftGroup/softgroup/data/custom.py", line 222, in collate_fn assert batch_id > 0, 'empty batch' AssertionError: empty batch

ERROR conda.cli.main_run:execute(47): conda run python /GitProject/SoftGroup/tools/train.py ../configs/softgroup/softgroup_my_dataset_backbone.yaml failed. (See above for error)

I've tried to change my parameters, but I've never been able to solve the problem, so I hope you can give me some advice!

thangvubk commented 10 months ago

This indicate all training samples doesnot valid after crop (see here) It is usually depend on the point density. Please check it on your own data.

huanghuang113 commented 10 months ago

Thanks for the suggestion, I will try it!

wdczz commented 8 months ago

This indicate all training samples doesnot valid after crop (see here) It is usually depend on the point density. Please check it on your own data.

But I use the default config about S3dis , it also have the same problem.May I change it lower?

thangvubk commented 8 months ago

Yes. I think you may change the min_npoint to lower value or increase the batch size.