Closed MABatin closed 2 years ago
Is a CUDA oom error, you can try to use a smaller input size or change bs=1. If it works, you can than try to use fp16 to train your models
I've tried resizing to (320, 320) and batch_size=1 but now it gives me this error.
Traceback (most recent call last):
File "D:\Leon\SpikeDetection\Wheat\tools\train.py", line 237, in <module>
main()
File "D:\Leon\SpikeDetection\Wheat\tools\train.py", line 226, in main
train_detector(
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\apis\train.py", line 244, in train_detector
runner.run(data_loaders, cfg.workflow)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 50, in train
self.run_iter(data_batch, train_mode=True, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 29, in run_iter
outputs = self.model.train_step(data_batch, self.optimizer,
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\parallel\data_parallel.py", line 75, in train_step
return self.module.train_step(*inputs[0], **kwargs[0])
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\detectors\base.py", line 248, in train_step
losses = self(**data)
File "C:\Users\leonh\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\runner\fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\detectors\base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\detectors\single_stage.py", line 82, in forward_train
x = self.extract_feat(img)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\detectors\single_stage.py", line 45, in extract_feat
x = self.neck(x)
File "C:\Users\leonh\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmcv\runner\fp16_utils.py", line 110, in new_func
return old_func(*args, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\necks\bifpn.py", line 383, in forward
feats = stack_bifpn(feats)
File "C:\Users\leonh\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\necks\bifpn.py", line 294, in forward
feats.append(new_op_node(input_node))
File "C:\Users\leonh\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Users\leonh\anaconda3\lib\site-packages\mmdet\models\necks\bifpn.py", line 156, in forward
x += w[i] * inputs[i]
RuntimeError: The size of tensor a (30) must match the size of tensor b (32) at non-singleton dimension 2
tensor size did not match in bifpn, you need to check your ccode
I'm trying to figure out but can't seem to get my head around it. When I separately try to pass input size of (640, 640) or (320,320) to the BiFPN module I get results but when using it in the config file like I did there, I get CUDA oom for input size (640, 640) and tensor size mismatch for (320, 320). Now I understand that the CUDA oom might be because of the ResamplingConv class but all it does is rescale the last input layer using MaxPool2d which shouldn't be so memory taxing but I could be wrong. Then for input size (320, 320) I don't understand why there would be size mismatch. I've tried to figure it out for hours but I am not so good at figuring this out.
Checklist
Describe the bug I am trying to train a retinanet model with bifpn neck. I've created necessary config files but the code stops in the training phase with
Reproduction
Environment
-I'm also adding the code of the BiFPN neck that I tried to use:
Error traceback
Bug fix I think the reason might be because of the torch._C._nn.upsample_nearest2d function but I don't know for sure. Any help to solve this would be much appreciated.