julimueller / tl_ssd

47 stars 17 forks source link

About max stride #3

Closed dreamay closed 5 years ago

dreamay commented 5 years ago

In the paper, I don’t understand this sentence-"In consequence, a maximum stride of 0.34·5 pixels =1.7 pixels is needed to guarantee a detection of objects with a width of 5 pixels. As seen in Table I, only layer conv 1 - conv 3 can satisfy this condition." Can you explain it? I really want to know this answer. Thanks a lot.

julimueller commented 5 years ago

Hi dreamay,

the section of the sentence you cited deals with a theoretical investigation about the relationship between object size and IoU we can reach considering that late feature layers of CNNs are of small spatial size. Basic SSD generally places prior boxes at the center of each feature cell. E.g. if we take an input image of size 100x100 and the feature layer we use fo prediction is of size 10x10, each feature cell is also of size 10x10 . Each pixel of the feature map maps on a 10x10 region in the image. So the coordinates of the first prior box would be (5,5) and the second one (15,5).

My idea was that I dont only use prior boxes at the center of each cell but an arbitrary number of priors. This helps that I get an sufficient overlap between prior box and ground truths which are small. Formula 4 tells us, that if we want an overlap of 0.5 for ALL ground truths, the stride of the prior boxes can be at least 34% of the smallest ground truth. That means, if we have a ground truth with a width of 5 pixels, the stride of the prior boxes should be <= 0.34*5 pixels = 1.7 ~= 2 pixels. With this knowledge, I try to then calculate how many prior boxes i will design per feature cell.

If we use the original SSD, a detection of very small objects will be highly unlikely because the stride is too high. And you are right, in original SSD only layers conv1-conv3 statisy condition, but with the proposed stride adaption this condition can also be statisfied in later layers.

Best, Julian

dreamay commented 5 years ago

@julimueller Thanks for your reply. But I don't understand why only layer conv 1 - conv 3 can satisfy this condition. Table I Can you explain it? Thanks.

Best, Dreamay

julimueller commented 5 years ago

Okay, i try to explain it again.

SSD only trains ground truths which have an IoU greater 0.3 or 0.5 (you can choose) to one prior box. Our input image is of size 2048x 512. If we now take e.g. conv 3 for prediction, which gives us a feature map of size 253x1021, the SSD prior box layer would design prior boxes with a stride of approximately 2. The feature map of conv 4 is even smaller of size 124x 508. If we would take this layer for prediction the prior boxes would have a stride of 4.

With a stride of 4 we cannot guarantee that for each ground truth with a width of 5 pixels we reach an overlap of 0.3 or greater with at least one prior box. In consequence, these ground truths would NOT be used for training. If we want to guarantee that all ground truths with a width of 5 pixels are used for training in ORIGINAL SSD, only conv1-conv3 can be used because in these layers the stride is "small enough".

dreamay commented 5 years ago

@julimueller Thank you very much. I fully understand. Thank you again.

Best, Dreamay

yusiyoh commented 2 years ago

Hello @julimueller, it's me again,

I am trying to implement the same stride adaptation idea to faster-RCNN since the strides of the input features of the FPN are all bigger than 4, which makes it miss many GTs (especially when the images are resized). However, as in #15 , I think I need to change something else than only the offsets. However, I could not understand the num_output here: https://github.com/julimueller/tl_ssd/issues/15#issuecomment-615789247 In addition, if I make a structural change like this, then I cannot use pretrained weights, right?

Best Regards, Yusuf

julimueller commented 2 years ago

The formula to calculate num_output is:

num_output = 4 count_offset_w count_offset_h * num_min_sizes

So in case we have:

offset_w = 0.5
offset_h = 0.5 

it is:

num_output = 4 1 1 * 7 = 28

In case we change it to:

offset_w: 0.2
offset_w: 0.4
offset_w: 0.6
offset_w: 0.8
offset_h: 0.5

it is num_output = 4 4 1* 7 = 112

Regarding pretrained weights: Do you refer to utilizing only a pretrained backbone (without prior box/ location prediction layer) or a pretrained SSD approach (with prior box/ location prediction layer)? The latter cannot be used if you change the number of prior boxes (because of differing dimensions), the former can be used because the location prediction layers are trained from scratch. If you change prior boxes, you also need to retrain your network.