johschmidt42 / PyTorch-Object-Detection-Faster-RCNN-Tutorial

76 stars 34 forks source link

Error Using FPN Backbone #1

Closed gsid888 closed 3 years ago

gsid888 commented 3 years ago

assert len(grid_sizes) == len(strides) == len(cell_anchors) error comes whe I use to have FPN backbone with the model

johschmidt42 commented 3 years ago

Could you provide some more information? What's your anchor_size and aspect_ratios?

johschmidt42 commented 3 years ago

Closing due to inactivity.

mzettwitz commented 3 years ago

Hi @johschmidt42, First of all: great tutorial! It really helped me to overcome the struggles of learning a new framework such as pytorch and it pointed me to some nice extras such as lightning or loggers (mlflow). So thank you very much for the detailed tutorial and the effort you have put in. I have a problem with the FPN backbone, too. I am not quite sure what parts I have to alter. So far, I've changed the params such as FPN to True, resulting in an error telling me that the anchorsize must be Tuple(Tuple[int])). Changing the anchor size from 'ANCHOR_SIZE': ((32, 64, 128, 256, 512),), to ((32,), (64,), (128,), (256,), (512,),), as seen in other tutorials for RCNN with FPN did not help. Can you please guide me into the right direction (or if not too much effort, give a short example of which params I have to alter and in which form)?

Grüße aus Deutschland :) Martin.

johschmidt42 commented 3 years ago

Hi, thank you! Simply remove one of the anchor_size values: From ((32,), (64,), (128,), (256,), (512,),) to ((32,), (64,), (128,), (256,),) Reason: This is actually well explained with this error message Anchors should be Tuple[Tuple[int]] because each feature map could potentially have different sizes and aspect ratios. There needs to be a match between the number of feature maps passed and the number of sizes / aspect ratios specified.

Better explanation: The get_resnet_backbone function I wrote returns a resnet backbone pretrained on ImageNet. But it also removes the average-pooling layer and the linear layer at the end. This is a bit different from other tutorials, where they don't remove the average-pooling layer (if I remember correctly). But with removing this layer, you'll have 4 feature map layers that you can use to create reference boxes with the anchor generator (instead of 5). If I also remember correctly, the last layer was dropped anyway, so there was no reason to specify 5 different anchor sizes ((32,), (64,), (128,), (256,), (512,),) because 512 wouldn't be used anyway. I discovered this when I looked at the implemenation of the anchor generator, which I recommend doing!

mzettwitz commented 3 years ago

Thank you for the very fast response. It works now, and additionally much thanks for the detailed explanation. I also have a deeper look into the generator and the backbone_resnet files :) Background: actually, I need to train a detector for handwriting/signatures in documents where image-based backends might not be perfect, though the first results are not bad at all.

Edit: I just saw, that the text detector engines in MMDetection/OCR use pre-trained image backbones, too, though they focus on text detection in real-world images (signs etc). Best, Martin

johschmidt42 commented 3 years ago

That's a very interesting topic, I'd like to see the results you'd get! Any chance to follow on your progress?

mzettwitz commented 3 years ago

Unfortunately, our repo is hosted on a private GitLab, but I'll keep you updated and share some results as soon as I got interesting things. Thanks to mlflow (and again thanks to you for highlighting loggers in your tutorial), I can keep track easily of the setups :) Unfortunately, the training takes a long time such that a grid search (or random search) for the best setup will take a while. Additionally, I'll try to run a setup with MMDetection in the future such that I can try a wider variety of implementations.

Until then, stay safe! Martin