Closed whuhxb closed 1 year ago
As @xindeng98 mentioned in https://github.com/guochengqian/PointNeXt/issues/18#issuecomment-1182670679, this error was caused by the fact that test_entire_room does not support multi-gpus testing.
Hi @haibo-qiu
Yes, it's caused by multi-gpus testing. Now, just using 1 GPU is OK.
Thanks @haibo-qiu. I will write a more clear documentation
Hi @guochengqian
Have you ever met this bug using two GPU cards? Thanks.
100%|██████[07/26 11:32:16 S3DIS]: Epoch 100 LR 0.000012 train_miou 95.15, val_miou 68.49, best val miou 69.55 100%|██████████| 34/34 [00:41<00:00, 1.22s/it] [07/26 11:32:18 S3DIS]: Best ckpt @E68, val_oa 89.78, val_macc 76.22, val_miou 69.55, iou per cls is: [93.14 97.92 83.66 0. 43.12 54.75 75.45 81.45 90.8 74.57 75.66 73.42 60.24] [07/26 11:32:18 S3DIS]: Successful Loading the ckpt from log/s3dis/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs/checkpoint/s3dis-train-pointnext-xl-ngpus2-seed7272-20220725-213543-eYW6GZURAs6oyghwFxnAPs_ckpt_best.pth [07/26 11:32:18 S3DIS]: ckpts @ 68 epoch( {'best_val': 69.55094146728516} ) 0%| | 0/68 [00:00<?, ?it/s] 0%| | 0/68 [00:27<?, ?it/s]
Traceback (most recent call last): File "examples/segmentation/main.py", line 529, in
mp.spawn(main, nprocs=cfg.world_size, args=(cfg,)) # original args=(cfg), run with bugs, should be args=(cfg,)
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 211, in main test_miou, test_macc, test_oa, test_ious, testaccs, = test_entire_room(model, cfg.dataset.common.test_area, cfg) File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, *kwargs) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/main.py", line 452, in test_entire_room cm.update(all_logits.argmax(dim=1), label) File "/export/home/myname/anaconda3/envs/openpoints/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(args, *kwargs) File "/export/home/myname/Documents/PointNeXt_code/PointNeXt/examples/segmentation/../../openpoints/utils/metrics.py", line 69, in update unique_mapping = true.flatten() self.virtual_num_classes + pred.flatten() RuntimeError: The size of tensor a (719348) must match the size of tensor b (1438695) at non-singleton dimension 0