guochengqian / PointNeXt

[NeurIPS'22] PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies
https://guochengqian.github.io/PointNeXt/
MIT License
759 stars 108 forks source link

RuntimeError: The size of tensor a (719348) must match the size of tensor b (1438695) at non-singleton dimension 0 #25

Closed whuhxb closed 1 year ago

whuhxb commented 2 years ago

Hi @guochengqian

Have you ever met this bug using two GPU cards? Thanks.

100%|██████[07/26 11:32:16 S3DIS]: Epoch 100 LR 0.000012 train_miou 95.15, val_miou 68.49, best val miou 69.55 100%|██████████| 34/34 [00:41<00:00, 1.22s/it] [07/26 11:32:18 S3DIS]: Best ckpt @E68, val_oa 89.78, val_macc 76.22, val_miou 69.55, iou per cls is: [93.14 97.92 83.66 0. 43.12 54.75 75.45 81.45 90.8 74.57 75.66 73.42 60.24] [07/26 11:32:18 S3DIS]: Successful Loading the ckpt from log/s3dis/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs/checkpoint/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs_ckpt_best.pth [07/26 11:32:18 S3DIS]: ckpts @ 68 epoch( {'best_val': 69.55094146728516} ) 0%| | 0/68 [00:00<?, ?it/s] 0%| | 0/68 [00:27<?, ?it/s]

Traceback (most recent call last): File "examples/segmentation/main.py", line 529, in mp.spawn(main, nprocs=cfg.world_size, args=(cfg,)) # original args=(cfg), run with bugs, should be args=(cfg,) File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 211, in main test_miou, test_macc, test_oa, test_ious, testaccs, = test_entire_room(model, cfg.dataset.common.test_area, cfg) File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, *kwargs) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 452, in test_entire_room cm.update(all_logits.argmax(dim=1), label) File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, *kwargs) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/../../openpoints/utils/metrics.py", line 69, in update unique_mapping = true.flatten() self.virtual_num_classes + pred.flatten() RuntimeError: The size of tensor a (719348) must match the size of tensor b (1438695) at non-singleton dimension 0

haibo-qiu commented 2 years ago

As @xindeng98 mentioned in https://github.com/guochengqian/PointNeXt/issues/18#issuecomment-1182670679, this error was caused by the fact that test_entire_room does not support multi-gpus testing.

whuhxb commented 2 years ago

Hi @haibo-qiu

Yes, it's caused by multi-gpus testing. Now, just using 1 GPU is OK.

guochengqian commented 2 years ago

Thanks @haibo-qiu. I will write a more clear documentation