Open JonathanCMitchell opened 6 years ago
also interested in this +1
+1
note that h and w are normalized coordinates. so your equation should be K = 4 + log(2, sqrt(224/1024×224/1024)/(224/sqrt(1024×1024))) = 4 + 0= 4. roi_level = minimum(5, 4) # so we set it to P4
Below issue explains the reason for that: https://github.com/matterport/Mask_RCNN/issues/217
If we assume h=w=ori_side/IMAGE_MAX_DIM and image_shape[0]= image_shape[1]=IMAGE_MAX_DIM, then 4 + log2(sqrt(h * w) / (224.0 / sqrt(image_area))) = 4 + log2(ori_side/IMAGE_MAX_DIM / (224/IMAGE_MAX_DIM)) =4 + log2(ori_side/ 224), It's the same as the equation in the FPN paper.
Inside PyramidROIAlign, we determine the levels of the feature pyramid network to assign to the ROI in question.
The equation is from section 4.2 equation (1) of the FPN paper.
In the code comments it says that a 224x224 ROI will map to level P4. However, when we feed those params into this equation:
Then we assign the roi_level to P5 because it passed the max value of 5. Therefore, if our ROI is larger than 224, it is automatically assigned to P5, and the issue is that P5 has a really small spatial resolution (1//64) of the original image shape, and we are giving it the bulk of the ROI's. Or so it seems maybe I am wrong.
Question (1): What are typical ROI sizes for a (1024, 1024, 3) image? Would these regions scale linearly if I reduce the input image dimension?
Question (2). If we are training at a lower resolution (say
(256,256, 3)
) then scaling by 256 won't really work because it is being wrapped in a log function so wouldn't that be a nonlinear scale?