thangvubk / SoftGroup

[CVPR 2022 Oral] SoftGroup for Instance Segmentation on 3D Point Clouds
MIT License
339 stars 80 forks source link

Training batch is empty #191

Closed wdczz closed 8 months ago

wdczz commented 8 months ago

`2023-10-31 06:41:44,913 - INFO - Epoch [19/20][890/1020] lr: 9.9e-05, eta: 0:51:48, mem: 11241, data_time: 4.89, iter_time: 5.45, semantic_loss: 0.3257, offset_loss: 0.4480, cls_loss: 0.4795, mask_loss: 0.3227, iou_score_loss: 0.0153, num_pos: 18.0000, num_neg: 29.0000, loss: 1.5912 2023-10-31 06:42:29,014 - INFO - Epoch [19/20][900/1020] lr: 9.9e-05, eta: 0:51:42, mem: 11241, data_time: 0.03, iter_time: 0.49, semantic_loss: 0.2589, offset_loss: 0.4198, cls_loss: 0.3547, mask_loss: 0.4894, iou_score_loss: 0.0110, num_pos: 16.0000, num_neg: 14.0000, loss: 1.5339 2023-10-31 06:42:58,843 - INFO - Epoch [19/20][910/1020] lr: 9.9e-05, eta: 0:51:18, mem: 11241, data_time: 0.01, iter_time: 0.65, semantic_loss: 0.3437, offset_loss: 0.5149, cls_loss: 0.4717, mask_loss: 0.2886, iou_score_loss: 0.0090, num_pos: 19.0000, num_neg: 28.5000, loss: 1.6280 2023-10-31 06:43:11,882 - INFO - Epoch [19/20][920/1020] lr: 9.9e-05, eta: 0:50:34, mem: 11241, data_time: 0.51, iter_time: 3.50, semantic_loss: 0.2511, offset_loss: 0.6011, cls_loss: 0.2802, mask_loss: 0.3404, iou_score_loss: 0.0096, num_pos: 15.0000, num_neg: 21.5000, loss: 1.4824 2023-10-31 06:43:57,692 - INFO - Epoch [19/20][930/1020] lr: 9.9e-05, eta: 0:50:29, mem: 11241, data_time: 6.67, iter_time: 7.08, semantic_loss: 0.2133, offset_loss: 0.4775, cls_loss: 0.3220, mask_loss: 0.2717, iou_score_loss: 0.0094, num_pos: 16.5000, num_neg: 16.0000, loss: 1.2939 2023-10-31 06:44:18,405 - INFO - Epoch [19/20][940/1020] lr: 9.9e-05, eta: 0:49:54, mem: 11241, data_time: 0.01, iter_time: 0.54, semantic_loss: 0.3968, offset_loss: 0.5065, cls_loss: 0.5000, mask_loss: 0.1897, iou_score_loss: 0.0119, num_pos: 8.5000, num_neg: 13.5000, loss: 1.6049 2023-10-31 06:45:06,014 - INFO - Epoch [19/20][950/1020] lr: 9.9e-05, eta: 0:49:50, mem: 11241, data_time: 5.65, iter_time: 6.15, semantic_loss: 0.7120, offset_loss: 0.4190, cls_loss: 0.4788, mask_loss: 0.2150, iou_score_loss: 0.0118, num_pos: 12.0000, num_neg: 22.5000, loss: 1.8367 2023-10-31 06:45:28,924 - INFO - Epoch [19/20][960/1020] lr: 9.9e-05, eta: 0:49:18, mem: 11241, data_time: 0.01, iter_time: 0.49, semantic_loss: 0.6841, offset_loss: 0.5910, cls_loss: 0.1762, mask_loss: 0.2319, iou_score_loss: 0.0073, num_pos: 15.0000, num_neg: 27.0000, loss: 1.6905 Traceback (most recent call last): File "./tools/train.py", line 212, in main() File "./tools/train.py", line 203, in main train(epoch, model, optimizer, scaler, train_loader, cfg, logger, writer) File "./tools/train.py", line 48, in train for i, batch in enumerate(train_loader, start=1): File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data return self._process_data(data) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data data.reraise() File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/_utils.py", line 543, in reraise raise exception AssertionError: Caught AssertionError in DataLoader worker process 3. Original Traceback (most recent call last): File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop data = fetcher.fetch(index) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 61, in fetch return self.collate_fn(data) File "/media/dell/whd/WDC/SoftGroup/softgroup/data/s3dis.py", line 82, in collate_fn return super().collate_fn(batch) File "/media/dell/whd/WDC/SoftGroup/softgroup/data/custom.py", line 222, in collate_fn assert batch_id > 0, 'empty batch' AssertionError: empty batch

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 496326 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 496327) of binary: /home/dell/anaconda3/envs/InstanceSegment37/bin/python Traceback (most recent call last): File "/home/dell/anaconda3/envs/InstanceSegment37/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==1.13.1', 'console_scripts', 'torchrun')()) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, *kwargs) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run )(cmd_args) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/dell/anaconda3/envs/InstanceSegment37/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent failures=result.failures, torch.distributed.elastic.multiprocessing.errors.ChildFailedError: `

I don't know why this problem comes up. It there something wrong with my s3dis dataset? But i still fix it by #51

thangvubk commented 8 months ago

I am not sure what is the problem. Could you retrain it with other seed?

wdczz commented 8 months ago

I am not sure what is the problem. Could you retrain it with other seed?

I have a try to deal with it like #181.If i am success in this training, I will close with this comment