Closed dreamay closed 5 years ago
Hi dreamay,
the section of the sentence you cited deals with a theoretical investigation about the relationship between object size and IoU we can reach considering that late feature layers of CNNs are of small spatial size. Basic SSD generally places prior boxes at the center of each feature cell. E.g. if we take an input image of size 100x100 and the feature layer we use fo prediction is of size 10x10, each feature cell is also of size 10x10 . Each pixel of the feature map maps on a 10x10 region in the image. So the coordinates of the first prior box would be (5,5) and the second one (15,5).
My idea was that I dont only use prior boxes at the center of each cell but an arbitrary number of priors. This helps that I get an sufficient overlap between prior box and ground truths which are small. Formula 4 tells us, that if we want an overlap of 0.5 for ALL ground truths, the stride of the prior boxes can be at least 34% of the smallest ground truth. That means, if we have a ground truth with a width of 5 pixels, the stride of the prior boxes should be <= 0.34*5 pixels = 1.7 ~= 2 pixels. With this knowledge, I try to then calculate how many prior boxes i will design per feature cell.
If we use the original SSD, a detection of very small objects will be highly unlikely because the stride is too high. And you are right, in original SSD only layers conv1-conv3 statisy condition, but with the proposed stride adaption this condition can also be statisfied in later layers.
Best, Julian
@julimueller Thanks for your reply. But I don't understand why only layer conv 1 - conv 3 can satisfy this condition. Table I Can you explain it? Thanks.
Best, Dreamay
Okay, i try to explain it again.
SSD only trains ground truths which have an IoU greater 0.3 or 0.5 (you can choose) to one prior box. Our input image is of size 2048x 512. If we now take e.g. conv 3 for prediction, which gives us a feature map of size 253x1021, the SSD prior box layer would design prior boxes with a stride of approximately 2. The feature map of conv 4 is even smaller of size 124x 508. If we would take this layer for prediction the prior boxes would have a stride of 4.
With a stride of 4 we cannot guarantee that for each ground truth with a width of 5 pixels we reach an overlap of 0.3 or greater with at least one prior box. In consequence, these ground truths would NOT be used for training. If we want to guarantee that all ground truths with a width of 5 pixels are used for training in ORIGINAL SSD, only conv1-conv3 can be used because in these layers the stride is "small enough".
@julimueller Thank you very much. I fully understand. Thank you again.
Best, Dreamay
Hello @julimueller, it's me again,
I am trying to implement the same stride adaptation idea to faster-RCNN since the strides of the input features of the FPN are all bigger than 4, which makes it miss many GTs (especially when the images are resized). However, as in #15 , I think I need to change something else than only the offsets. However, I could not understand the num_output here: https://github.com/julimueller/tl_ssd/issues/15#issuecomment-615789247 In addition, if I make a structural change like this, then I cannot use pretrained weights, right?
Best Regards, Yusuf
The formula to calculate num_output is:
num_output = 4 count_offset_w count_offset_h * num_min_sizes
So in case we have:
offset_w = 0.5
offset_h = 0.5
it is:
num_output = 4 1 1 * 7 = 28
In case we change it to:
offset_w: 0.2
offset_w: 0.4
offset_w: 0.6
offset_w: 0.8
offset_h: 0.5
it is num_output = 4 4 1* 7 = 112
Regarding pretrained weights: Do you refer to utilizing only a pretrained backbone (without prior box/ location prediction layer) or a pretrained SSD approach (with prior box/ location prediction layer)? The latter cannot be used if you change the number of prior boxes (because of differing dimensions), the former can be used because the location prediction layers are trained from scratch. If you change prior boxes, you also need to retrain your network.
In the paper, I don’t understand this sentence-"In consequence, a maximum stride of 0.34·5 pixels =1.7 pixels is needed to guarantee a detection of objects with a width of 5 pixels. As seen in Table I, only layer conv 1 - conv 3 can satisfy this condition." Can you explain it? I really want to know this answer. Thanks a lot.