google-research / receptive_field

Compute receptive fields of your favorite convnets
Apache License 2.0
432 stars 54 forks source link

receptive field larger than the input size. How should I interpret this? #6

Closed SorourMo closed 4 years ago

SorourMo commented 4 years ago

Hi @andrefaraujo, Thank you for your informative blog and for sharing your code. I have a model for semantic segmentation implemented in Keras (let's call it model_A). It is a fully convolutional network with many convolutions, poolings, convolution transpose layers, etc. I used your code to compute the receptive field. Since I got an error regarding Conv2D-transpose layers, I disabled the upsampling branch of the network and considered only the first half of the network for now. I also disabled batchnorm and dropout layers to avoid errors.

The results of the code are as follows: print(rf_x, rf_y, eff_stride_x, eff_stride_y, eff_pad_x, eff_pad_y): 318 318 32 32 143 143

Imagine if I use (train/test) model_A with an input size of in_s =200 (or any number less than 318). In this case, what this exactly means? Does it mean that model_A takes into account the entire input image while extracting features?

Another question is that how including the upsampling branch of the network would affect the receptive field? Would it be any different given the fact that the input and output size of the model_A is the same as each other (like any other segmentation network)? Thank you.

P.S. In case you're curious about the error regarding Conv2Dtranspose, the following line shows the raised error: ValueError: Unknown layer for operation 'conv2d_transpose/conv2d_transpose': Conv2DBackpropInput`

andrefaraujo commented 4 years ago

Regarding the first question: the receptive field size does not depend on the input image size, it only depends on the layers used in the network. So, for whatever input resolution you use, you will get the same numbers. For small images, the receptive field may indeed cover the whole image, however this does not usually mean that each pixel in that region has equal contribution. To quote our Distill paper,

In the most recent networks, the receptive field usually covers the entire input image: this means that the context used by each feature in the final output feature map includes all of the input pixels.

note that a given feature is not equally impacted by all input pixels within its receptive field region: the input pixels near the center of the receptive field have more “paths” to influence the feature, and consequently carry more weight. The relative importance of each input pixel defines the effective receptive field of the feature. Recent work [17] provides a mathematical formulation and a procedure to measure effective receptive fields, experimentally observing a Gaussian shape, with the peak at the receptive field center.

For the second question: it depends on what the upsampling layer is doing. If it is doing a nearest neighbor upsampling, eg simply copying input features multiple times, the RF does not change (this can be thought of having kernel size of 1); if it is interpolating pixels, then it becomes similar to a convolutional layer and the receptive field would naturally increase (this can be thought of having kernel size of the number of input features used in the interpolation).