Closed LNoving closed 4 years ago
Convolutional layer without padding can only guarantee the alignment of center of both original image and feature maps. Therefore, the position of pixels on the feature map will be mapped to the original search image by (w//2, h//2) + stride*(dist_from_center). It should be noted that this is the mapping between centers of pixels, not the corresponding search image area. In other worlds, the 136 is the farmost position in the center points.
I see. Thanks for replying.
Take Alexnet for example, As far as I know, the network take 303✖️303 search image as input and do a few downsampling and finally output 17✖️17 feature map of classification score and center score (fcos_cls_score_final in code). So the 17✖️17 matches the 303✖️303. Even considering some frontier info loss caused by no padding, the 17✖️17 will still match maybe some area like 200✖️200 in my opinion. However when it comes to target generation in 'make_densebox_target.py', we just crop a (17✖️8)✖️(17✖️8)=(136✖️136) area from the center of 303✖️303 and use it to calculate target. This area seems too small, and doesn't match the network output. In other words, 17*17 seems to match a much larger area of input search image in network, and match a very small area in target calculating. Which in my opinion is not reasonable.
Maybe I misunderstand some code, welcome to point out.