matterport / Mask_RCNN

Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow
Other
24.74k stars 11.71k forks source link

roi level calcualation inside PyramidRoiAlign #471

Open JonathanCMitchell opened 6 years ago

JonathanCMitchell commented 6 years ago

Inside PyramidROIAlign, we determine the levels of the feature pyramid network to assign to the ROI in question.

The equation is from section 4.2 equation (1) of the FPN paper.

image_area = tf.cast(
            self.image_shape[0] * self.image_shape[1], tf.float32)
        roi_level = log2_graph(tf.sqrt(h * w) / (224.0 / tf.sqrt(image_area)))
        roi_level = tf.minimum(5, tf.maximum(
            2, 4 + tf.cast(tf.round(roi_level), tf.int32)))
        roi_level = tf.squeeze(roi_level, 2)

In the code comments it says that a 224x224 ROI will map to level P4. However, when we feed those params into this equation:

K = 4 + log(2, sqrt(224×224)/(224/sqrt(1024×1024))) = 4 + 10 = 14. 
roi_level = minimum(5, 14) # so we set it to P5

Then we assign the roi_level to P5 because it passed the max value of 5. Therefore, if our ROI is larger than 224, it is automatically assigned to P5, and the issue is that P5 has a really small spatial resolution (1//64) of the original image shape, and we are giving it the bulk of the ROI's. Or so it seems maybe I am wrong.

Question (1): What are typical ROI sizes for a (1024, 1024, 3) image? Would these regions scale linearly if I reduce the input image dimension?

Question (2). If we are training at a lower resolution (say (256,256, 3)) then scaling by 256 won't really work because it is being wrapped in a log function so wouldn't that be a nonlinear scale?

gustavz commented 6 years ago

also interested in this +1

derfe commented 6 years ago

+1

liuwenran commented 6 years ago

note that h and w are normalized coordinates. so your equation should be K = 4 + log(2, sqrt(224/1024×224/1024)/(224/sqrt(1024×1024))) = 4 + 0= 4. roi_level = minimum(5, 4) # so we set it to P4

aashokvardhan commented 4 years ago

Below issue explains the reason for that: https://github.com/matterport/Mask_RCNN/issues/217

rcx986635 commented 4 years ago

If we assume h=w=ori_side/IMAGE_MAX_DIM and image_shape[0]= image_shape[1]=IMAGE_MAX_DIM, then 4 + log2(sqrt(h * w) / (224.0 / sqrt(image_area))) = 4 + log2(ori_side/IMAGE_MAX_DIM / (224/IMAGE_MAX_DIM)) =4 + log2(ori_side/ 224), It's the same as the equation in the FPN paper.