thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds
MIT License
339 stars 80 forks source link

Train my own data error: #175

Open qzsrh opened 1 year ago

qzsrh commented 1 year ago

Description of data:

Outdoor data, label only belong to the category of people,The original data is consistent with the data type of S3DIS,Put the data to SoftGroup/dataset/s3dis/ folder,and do this cd SoftGroup/dataset/s3dis bash prepare_data.sh

Then change softgroup_s3dis_backbone_fold5.yaml : model: channels: 32 num_blocks: 7 semantic_classes: 2 instance_classes: 2 sem2ins_classes: [] semantic_only: True ignore_label: -100 grouping_cfg: score_thr: 0.2 radius: 0.04 mean_active: 300 class_numpoint_mean: [12210,39796] npoint_thr: 0.05 # absolute if class_numpoint == -1, relative if class_numpoint != -1 ignore_classes: [] instance_voxel_cfg: scale: 20 spatial_shape: 20 train_cfg: max_proposal_num: 200 pos_iou_thr: 0.5 test_cfg: x4_split: True cls_score_thr: 0.001 mask_score_thr: -0.5 min_npoint: 100 eval_tasks: ['semantic'] fixed_modules: []

data: train: type: 's3dis' data_root: 'dataset/s3dis/preprocess' prefix: ['Area_1', 'Area_2', 'Area_3', 'Area_4'] suffix: '_inst_nostuff.pth' repeat: 20 training: True voxel_cfg: scale: 20 spatial_shape: [128, 512] max_npoint: 250000 min_npoint: 5000 test: type: 's3dis' data_root: 'dataset/s3dis/preprocess' prefix: 'Area_5' suffix: '_inst_nostuff.pth' training: False voxel_cfg: scale: 20 spatial_shape: [128, 512] max_npoint: 250000 min_npoint: 5000

dataloader: train: batch_size: 6 num_workers: 6 test: batch_size: 1 num_workers: 1

optimizer: type: 'Adam' lr: 0.002

fp16: False epochs: 5 step_epoch: 0 save_freq: 2 pretrain: './hais_ckpt_spconv2.pth' work_dir: ''

Then run ./tools/dist_train.sh configs/softgroup/softgroup_s3dis_backbone_fold5.yaml 2 --skip_validate Error generated: 2023-05-04 16:35:52,027 - INFO - Training 2023-05-04 16:36:23,085 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:36:24,014 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:36:24,214 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:36:27,134 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:36:28,194 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:36:36,094 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:36:50,126 - INFO - Reducer buckets have been rebuilt in this iteration. 2023-05-04 16:36:50,127 - INFO - Reducer buckets have been rebuilt in this iteration. 2023-05-04 16:36:54,860 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:37:01,948 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:37:03,597 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:37:04,010 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:37:04,346 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:37:13,446 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:37:14,611 - INFO - Epoch [1/5][10/1921] lr: 0.002, eta: 22:00:35, mem: 2664, data_time: 0.00, iter_time: 0.32, semantic_loss: 0.2802, offset_loss: 0.1478, loss: 0.4279 2023-05-04 16:37:28,037 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:37:35,358 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:37:35,713 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:37:38,783 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:37:45,560 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:38:03,211 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:38:11,835 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:38:12,511 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:38:15,425 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:38:16,396 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:38:21,079 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:38:21,660 - INFO - Epoch [1/5][20/1921] lr: 0.002, eta: 19:55:10, mem: 2664, data_time: 0.00, iter_time: 0.27, semantic_loss: 0.0984, offset_loss: 0.1379, loss: 0.2363 2023-05-04 16:38:28,258 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:38:48,085 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:38:50,249 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:38:51,320 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:38:52,730 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:38:53,705 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:38:59,308 - INFO - Epoch [1/5][30/1921] lr: 0.002, eta: 16:36:12, mem: 2726, data_time: 0.00, iter_time: 0.17, semantic_loss: 0.0437, offset_loss: 0.1729, loss: 0.2166 2023-05-04 16:39:03,889 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:39:23,734 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:39:28,241 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:39:32,227 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:39:34,514 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:39:34,959 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:39:36,738 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:02,695 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:40:04,603 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:08,027 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:09,246 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:09,419 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:10,507 - INFO - Epoch [1/5][40/1921] lr: 0.002, eta: 17:10:08, mem: 2726, data_time: 0.00, iter_time: 0.15, semantic_loss: 0.0257, offset_loss: 0.0000, loss: 0.0257 2023-05-04 16:40:14,869 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:40:35,827 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:40:37,720 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:40:43,784 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:44,789 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:40:45,510 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:40:51,487 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:41:07,518 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:41:10,006 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:41:19,536 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:41:24,831 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:41:28,583 - INFO - Epoch [1/5][50/1921] lr: 0.002, eta: 17:51:55, mem: 2726, data_time: 0.00, iter_time: 0.20, semantic_loss: 0.0183, offset_loss: 0.1210, loss: 0.1393 2023-05-04 16:41:29,044 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:41:30,164 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:41:41,683 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:41:50,007 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:41:52,864 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:41:58,937 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:05,597 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:09,111 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:42:13,219 - INFO - Epoch [1/5][60/1921] lr: 0.002, eta: 16:50:37, mem: 2726, data_time: 0.00, iter_time: 0.13, semantic_loss: 0.0136, offset_loss: 0.0000, loss: 0.0136 2023-05-04 16:42:16,768 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:17,744 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:42:29,310 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:31,609 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:42:41,886 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:44,707 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:42:46,316 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:42:53,790 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:42:59,427 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:43:03,653 - INFO - batch is truncated from size 6 to 1 2023-05-04 16:43:14,362 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:43:14,458 - INFO - batch is truncated from size 6 to 3 2023-05-04 16:43:22,210 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:43:29,514 - INFO - Epoch [1/5][70/1921] lr: 0.002, eta: 17:18:35, mem: 2726, data_time: 0.00, iter_time: 0.15, semantic_loss: 0.0109, offset_loss: 0.0000, loss: 0.0109 2023-05-04 16:43:31,523 - INFO - batch is truncated from size 6 to 2 2023-05-04 16:43:40,498 - INFO - batch is truncated from size 6 to 5 2023-05-04 16:43:41,620 - INFO - batch is truncated from size 6 to 4 2023-05-04 16:43:55,737 - INFO - batch is truncated from size 6 to 3 Traceback (most recent call last): File "./tools/train.py", line 206, in main() File "./tools/train.py", line 199, in main train(epoch, model, optimizer, scaler, train_loader, cfg, logger, writer) File "./tools/train.py", line 44, in train for i, batch in enumerate(train_loader, start=1): File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 530, in next data = self._next_data() File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1224, in _next_data return self._process_data(data) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1250, in _process_data data.reraise() File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/_utils.py", line 457, in reraise raise exception AssertionError: Caught AssertionError in DataLoader worker process 1. Original Traceback (most recent call last): File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop data = fetcher.fetch(index) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch return self.collate_fn(data) File "/home/odrobot/workplace/zs_workplace/softgroup_outdoor/SoftGroup/softgroup/data/s3dis.py", line 82, in collate_fn return super().collate_fn(batch) File "/home/odrobot/workplace/zs_workplace/softgroup_outdoor/SoftGroup/softgroup/data/custom.py", line 222, in collate_fn assert batch_id > 0, 'empty batch' AssertionError: empty batch

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1180 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1179) of binary: /home/odrobot/anaconda3/envs/softgroup/bin/python Traceback (most recent call last): File "/home/odrobot/anaconda3/envs/softgroup/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.11.0', 'console_scripts', 'torchrun')()) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, *kwargs) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/distributed/run.py", line 724, in main run(args) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/distributed/run.py", line 718, in run )(cmd_args) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/odrobot/anaconda3/envs/softgroup/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 247, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./tools/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-05-04_16:44:04 host : odrobot rank : 0 (local_rank: 0) exitcode : 1 (pid: 1179) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ I don't know where the problem is. I look forward to your help!Thank you very much!!!!
huanghuang113 commented 11 months ago

how do you deal with this problem?

wangjuansan commented 10 months ago

May I ask how do you deal with this problem? I'm having the same problem.

thangvubk commented 10 months ago

Duplicate with https://github.com/thangvubk/SoftGroup/issues/181