rbgirshick / py-faster-rcnn

Faster R-CNN (Python implementation) -- see https://github.com/ShaoqingRen/faster_rcnn for the official MATLAB version
Other
8.12k stars 4.11k forks source link

The problems about anchors ? #112

Open JohnnyY8 opened 8 years ago

JohnnyY8 commented 8 years ago

In "Faster R-CNN" there is a picture as follows. Causer I don't understand the "Anchors" much. How can the 256-D features foward propagate from the "intermediate layer" to "cls layer" and "reg layer"? According to the network describtion that I saw in caffe, The propcess from sliding window to intermediate layer for generating the (2+4) x k output achieved by two kinds of convolution layers. So what do the anchors work for? Because, according to the text description, 9 anchors are got from each sliding window. Our conjecture is that the parameters between intermediate layer and cls layer or the parameters between intermediate layer and reg layer is fixed in order to achieve a specific portion of the extraction window to do the classification and regression, but this is only our guess. 20160311102155

For Table 1,we don't really understand the meaning either. How do the scales and aspect ratios of the anchors which given in the first line correspond to the generated proposal given in the second line? 20160311102304

happyharrycn commented 8 years ago

Anchors are sampled local windows with different scales / aspect ratio. The Region Proposal Network (RPN) classifies each local anchor (window) to be either foreground or background, and stretch the foreground windows to fit ground truth object bounding boxes. The implementation detail of how the anchors are defined can be found in ./lib/rpn/generate_anchors.py.

JohnnyY8 commented 8 years ago

Hello @happyharrycn : I have seen the source codes in ./lib/rpn/generate_anchors.py. It shows how to generate the anchors and I have some questions about that. Do 16 * 16 base anchors correspond to the feature maps of last convolutional layer ? In addition, I notice the anchor_target_layer.py is used when training and the proposal_target_layer.py is used when testing. I do not understand these two layers. I guess that training process make the weights fit the different scales and aspect ratios, so in testing the anchor_target_layer.py is not used?

happyharrycn commented 8 years ago

The 16*16 case comes from the down-sample factor in conv5. This number also controls the scale. anchor_target_layer is used to match gt object boxes to proposals and generates training labels of the proposals, which are further needed by the loss function. It is thus not required during testing.

JohnnyY8 commented 8 years ago

Yes, as you said, anchor_target_layer is not required during testing. But I still do not understand what do anchors do? In table1, 128^2 means 128 * 128 pixels in a anchor? I think that from sliding windows to proposals is achieved by two convolutional layers. So how do anchors affect the proposals? Could you please give me some details about this part of process? Thank you! @happyharrycn

wangfeng1981 commented 7 years ago

I am confusing why an anchor of 128x128 make a proposal of 188x111.

zhenni commented 7 years ago

@wangfeng1981

In ./lib/rpn/generate_anchors.py, you can find that the anchor 128x128 with ration 2:1 is [-83 -39 100 56] whose size is 184x96. And JohnnyY8 calculated an average size of the proposals. So it will have some differences with the actual sizes. Also, you can change the anchor sizes if your objects have some special sizes, which may give you a more accurate results.

@JohnnyY8

Although you do not need to use anchor_target_layer, when you test proposal_layer also do the task, add classification and regression using scores and regression targets from the prediction layers.

In ./lib/rpn/generate_anchors.py, they show the anchors as follows.

#    anchors =
#
#       -83   -39   100    56
#      -175   -87   192   104
#      -359  -183   376   200
#       -55   -55    72    72
#      -119  -119   136   136
#      -247  -247   264   264
#       -35   -79    52    96
#       -79  -167    96   184
#      -167  -343   184   360
shamanez commented 7 years ago

@zhenni

What are height_stride , width_stride parameters in anchor generator .

Here

zhenni commented 7 years ago

@shamanez The anchors behave like sliding windows. The strides are the distances between the two adjacent anchors/sliding windows vertically and horizontally.

For example, see how it works in lib/rpn/proposal_layer.py (where _feat_stride does not distinguish the horizontal and vertical strides.)

        anchor_scales = layer_params.get('scales', cfg.ANCHOR_SCALES)
        self._anchors = generate_anchors(scales=np.array(anchor_scales))
        # 1. Generate proposals from bbox deltas and shifted anchors
        height, width = scores.shape[-2:]

        # Enumerate all shifts
        shift_x = np.arange(0, width) * self._feat_stride
        shift_y = np.arange(0, height) * self._feat_stride
        shift_x, shift_y = np.meshgrid(shift_x, shift_y)
        shifts = np.vstack((shift_x.ravel(), shift_y.ravel(),
                            shift_x.ravel(), shift_y.ravel())).transpose()

        # Enumerate all shifted anchors:
        #
        # add A anchors (1, A, 4) to
        # cell K shifts (K, 1, 4) to get
        # shift anchors (K, A, 4)
        # reshape to (K*A, 4) shifted anchors
        A = self._num_anchors
        K = shifts.shape[0]
        anchors = self._anchors.reshape((1, A, 4)) + \
                  shifts.reshape((1, K, 4)).transpose((1, 0, 2))
        anchors = anchors.reshape((K * A, 4))
shamanez commented 7 years ago

@zhenni

So is it like this? , RPN uses final conv layer to propose bbxs . More specifically we slide a small window (3*3) in the last conv layer. For the center of each window position we take K(9) anchors .

Now we have to map the conv feature map center pixel position in to the real image , That's what we assign with the parameter _first_stage_featuresstride .

And it means if we move one position in the conv feature map , it should be 16 pixels (stride) in the real image . Am I correct ?

Another question , what should be the input to the network ? What is mention by this parameter

zhenni commented 7 years ago

@shamanez Yeah, I think you are right.

I guess it works for that you want to resize the input image with the initial aspect, and make sure it has a minimum dimension(height/width) of 600 pixels and a max dimension(width/height) of 1024 pixels. (Although I do not check the code in Tensorflow version, I am not very sure.) For example, if you have an image of size 400x512, it will be resized to 600x768; if the original image size is 100x512, the code will resize it to 200x1024. Something like this....

shamanez commented 7 years ago

@zhenni
What are these scale parameters in faster_rcnn_resnet101_pets.config .

In the paper they say the have scales squared of , 128,256 and 512 . So first I thought scales are in the fraction of 256 . But here they have four. So can you please elaborate on this .

zhenni commented 7 years ago

@shamanez plz check the code in grid_anchor_generator.py link

You can use different anchor sizes for the network from the paper describes. Using smaller anchors you can detect smaller objects in the pictures.

    Args:
      scales: a list of (float) scales, default=(0.5, 1.0, 2.0)
      aspect_ratios: a list of (float) aspect ratios, default=(0.5, 1.0, 2.0)
      base_anchor_size: base anchor size as height, width (
                        (length-2 float32 list, default=[256, 256])
      anchor_stride: difference in centers between base anchors for adjacent
                     grid positions (length-2 float32 list, default=[16, 16])
      anchor_offset: center of the anchor with scale and aspect ratio 1 for the
                     upper left element of the grid, this should be zero for
                     feature networks with only VALID padding and even receptive
                     field size, but may need additional calculation if other
                     padding is used (length-2 float32 tensor, default=[0, 0])
shamanez commented 7 years ago

@zhenni actually went through the whole repo. From what I understood this take variable size (In spacial dimensions ) input image bounded by given aspect ration . Then perform the convolution part and get the feature maps . From feature maps inorder to get scores on RPN or fee

I also went through the image resize function , and you are correct. It will keep the aspect ration of any image by keeping bound to that range. And It uses bi-linear interpolation in-order to reduce the distortion .

shamanez commented 7 years ago

@zhenni This is what TF repo says about the resize function :100:

  1. If the image can be rescaled so its minimum dimension is equal to the provided value without the other dimension exceeding max_dimension, then do so.
    1. Otherwise, resize so the largest dimension is equal to max_dimension.
gentlebreeze1 commented 6 years ago

hi, my input image is 1920*1080?i change C.TEST.SCALES = (1080,) C.TEST.MAX_SIZE = 1920. how i change anchor size in grid_anchor_generator.py @zhenni

zhenni commented 6 years ago

@gentlebreeze1 You can modify the function generate_anchors in lib/rpn/generate_anchors.py

ghost commented 6 years ago

@zhenni hi zhenni, i was looking at this function about base_size. Originally it was 16 in py-faster-rcnn, but in TF version it is by default 256. I was wondering, say if we don't filter small objects (by setting RPN_MIN_SIZE: 0), do we change base_size = 0? but then i see base_size is used to create an array.. base_anchor = np.array([1, 1, base_size, base_size]) - 1

zhenni commented 6 years ago

@loackerc Hi loackerc, I am not sure that I got the idea about the question. I think the anchor size might need to be changed accord to your data. As for the array that base_size create, base_anchor, is the left-top and right bottom point of the box. [left-top-x, left-top-y, right-bottom-x, right-bottom-y ]

myagmur01 commented 6 years ago

@shamanez Anchor scales and aspect ratios are explained in Faster-RCNN paper . Check it out first, especially the experiment part on MSCOCO dataset. Here the key quota:

For the anchors, we use 3 aspect ratios and 4 scales (adding 64), mainly motivated by handling small objects on this dataset.

So it is obvious that they use 4 scales in case of MSCOCO dataset since it contains many small objects.

BussLightYear commented 5 years ago

I'm confused with something if someone could help me... I'm currently working on this https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10 but the tutorial says the bounding boxes should be at least 33x33 pixels, I don't know why and would like to know because my bounding boxes are smaller than that. I try checking some code related to the anchors generation and found the same 16x16 anchor stride that's here. I think is somekind related to the 33x33 min size bbox but don't how. Thanks