tusen-ai / simpledet

A Simple and Versatile Framework for Object Detection and Instance Recognition
Apache License 2.0
3.08k stars 488 forks source link

Anchor boxes configuration for RPN #215

Closed apatsekin closed 5 years ago

apatsekin commented 5 years ago

In trident configuration file (here) Region Proposal Network params are defined as following:

class anchor_generate:
            scale = (2, 4, 8, 16, 32)
            ratio = (0.5, 1.0, 2.0)
            stride = 16
            image_anchor = 256

What does actually image_anchor and scale mean? My initial assumption was that it takes base anchor size: 256256 and produces 5 scales and 3 ratios according to scale and ratio multipliers. However, scale 32 for 256^2 (32256 x 32*256 box) doesn't make any sense. Could someone explain in details what exactly those parameters mean? Source code just passes it to mxnet function, which looks like an interface without implementation. And documentation just says "Used to generate anchor windows by enumerating scales"

RogerChern commented 5 years ago

Sorry for the confusion, the image_anchor is the number of anchors per image selected for training. We will add a fully annotated config soon.

RogerChern commented 5 years ago

Solved by #217

apatsekin commented 5 years ago

Thank you @RogerChern for quick response!

Please correct me if I am wrong:

class anchor_generate:
           scale = (2, 4, 8, 16, 32) # corresponds to 32x32, 64x64, 128x128, 512x512 of original image
           ratio = (0.5, 1.0, 2.0) # 16x32, 32x32, 32x64, etc...
           stride = 16 # one "pixel" in feature map corresponds to 16x16 of original image
           image_anchor = 256 # number of top confidence regions passed to classification head out of RPN

Also I read your doc with config comments. On the example of same TridentNet config

   class subsample_proposal:
            proposal_wo_gt = True #not actually sure what are the "proposals without ground truth" in this context?
            image_roi = 128 # number of anchor boxes randomly sampled for training
            fg_fraction = 0.5 #fraction of foreground boxes from those 128. Why is it said "the **maximum** fraction" given that FG boxes are scarce compared to BG and usually undersampled?
            fg_thr = 0.5 # boxes with GT IoU in range [0.5,1.0] assigned to  FG for loss function
            bg_thr_hi = 0.5 # boxes with IoU in range [0.0; 0.5] assigned to background for loss function
            bg_thr_lo = 0.0 # background boxes with IoU below this one dropped from loss?

The question is: original Faster-RCNN paper uses [0,0.3] for background, [0.3,0.7] dropped from loss and [0.7, 1.0] for foreground. In TridentNet config looks like you use strict [0,0.5] for FG and [0.5,1.0] for BG threshold. Is it correct? Thanks again for being responsive!

RogerChern commented 5 years ago

@apatsekin https://github.com/TuSimple/simpledet/blob/78467b7233d33d09b692ee4ca5fb3cbe46b85ee5/doc/fully_annotated_config.py#L122-L136

proposal subsampling is aimed for generating the target for the RCNN bbox head, not the RPN head.