megvii-research / video_analyst

A series of basic algorithms that are useful for video understanding, including Single Object Tracking (SOT), Video Object Segmentation (VOS) and so on.
MIT License
832 stars 176 forks source link

Feature map of network output and target seem not match in tracking #103

Closed LNoving closed 4 years ago

LNoving commented 4 years ago

Take Alexnet for example, As far as I know, the network take 303✖️303 search image as input and do a few downsampling and finally output 17✖️17 feature map of classification score and center score (fcos_cls_score_final in code). So the 17✖️17 matches the 303✖️303. Even considering some frontier info loss caused by no padding, the 17✖️17 will still match maybe some area like 200✖️200 in my opinion. However when it comes to target generation in 'make_densebox_target.py', we just crop a (17✖️8)✖️(17✖️8)=(136✖️136) area from the center of 303✖️303 and use it to calculate target. This area seems too small, and doesn't match the network output. In other words, 17*17 seems to match a much larger area of input search image in network, and match a very small area in target calculating. Which in my opinion is not reasonable.

Maybe I misunderstand some code, welcome to point out.

lzx1413 commented 4 years ago

Convolutional layer without padding can only guarantee the alignment of center of both original image and feature maps. Therefore, the position of pixels on the feature map will be mapped to the original search image by (w//2, h//2) + stride*(dist_from_center). It should be noted that this is the mapping between centers of pixels, not the corresponding search image area. In other worlds, the 136 is the farmost position in the center points.

LNoving commented 4 years ago

I see. Thanks for replying.