Why there is a '-1' in the output size formulation of vl_nnconvt

barisgecer commented 8 years ago

Hi, I was wondering why there is a '-1' in the output size formulation of vl_nnconvt:

YH = UPH (XH - 1) + FH - CROPTOP - CROPBOTTOM, YW = UPW (XW - 1) + FW - CROPLEFT - CROPRIGHT.

After a bit digging, I suspect there is a bug in matconvnet/matlab/src/bits/nnconv.cu line:189

for (int image = 0 ; image < data.getSize() ; ++image)

Shouldn't it be either of two ways below to cover all the pixels:

for (int image = 0 ; image <= data.getSize() ; ++image) for (int image = 0 ; image < data.getSize() ; image++)

Or maybe I am confused. If this is the case, can you please explain why there is '-1' before up-sampling.

P.S. : This might seems unimportant, but I am trying to match the size of my labels with output of FCN.

vedaldi commented 8 years ago

Hi, it is indeed a bit messy. It is all explained here:

http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf http://www.vlfeat.org/matconvnet/matconvnet-manual.pdf

See in particular chapter 5 and section 5.3.

I also suggest to use the DagNN.print() functionality e.g.

DagNN.print({‘input’, [224 224 3 1]}, ‘Dependencies’, true)

should give you all the information required to do this alignment (you many need to replace input with the name of input variable). In particular, look for the size of the layers as well as the strides and offsets.

Due to various quantisation in the architecture, it is a bit difficult to get right. In CaffeFCN they use an “adaptable” crop layer that helps a bit by matching the size of different layers (this is implemented as the dagnn.Crop layer if you want to use it). However, I recommend trying to figure out the alignment ``properly’’ first to understand what is going on.

I was wondering why there is a '-1' in the output size formulation of vl_nnconvt:

YH = UPH (XH - 1) + FH - CROPTOP - CROPBOTTOM, YW = UPW (XW - 1) + FW - CROPLEFT - CROPRIGHT

These formulas are (very likely) correct. E.g. if your input image is 1 pixel wide (XH = 1) and your interpolating filter is FH pixel wide, then the output will have FH pixels (minus cropping).

Good luck!

barisgecer commented 8 years ago

Hi Vedaldi,

Thank you for clear explanation. I understood why we have '-1' there. But I still have doubts about how should I align prediction and label.

Let us have an 'm' by 'n' input, when I use 'print' function with this input size, I can see what would the size of prediction be, which is also the expected label size. (say 'Pm' by 'Pn' ). Should I simply re-size (downgrade) associated label (which is 'm' by 'n') to 'Pm' by 'Pn' ?

Second, I thought cropping should be as follows:

CROPTOP = CROPBOTTOM = FH / 2 CROPLEFT = CROPRIGHT = FW / 2

to obtain only necessary information in the prediction and align properly. Yet in the sample fcn model, crop is set to [16 16 16 16 ] where filter size is [64 64] , Why it is done like this? Don't you think crop should be set to [32 32 32 32] ?

Regards.

vedaldi commented 8 years ago

Hi,

Let us have an 'm' by 'n' input, when I use 'print' function with this input size, I can see what would the size of prediction be, which is also the expected label size. (say 'Pm' by 'Pn' ). Should I simply re-size (downgrade) associated label (which is 'm' by 'n') to 'Pm' by 'Pn’ ? Second, I thought cropping should be as follows:

CROPTOP = CROPBOTTOM = FH / 2 CROPLEFT = CROPRIGHT = FW / 2

to obtain only necessary information in the prediction and align properly. Yet in the sample fcn model, crop is set to [16 16 16 16 ] https://github.com/vlfeat/matconvnet-fcn/blob/master/fcnInitializeModel.m#L72where filter size is [64 64] , Why it is done like this? Don't you think crop should be set to [32 32 32 32] ?

While this is about right, it is slightly more complicated than that. First, you need a definition of what it means to properly align feature maps. In our examples (and in FCN) we assume that this means that the receptive fields (RFs) of the two features back projected on the input image (or some common parent layer) match. This in turns means that they should have corresponding centres, which requires the centre offsets and strides to be the same.

Now unfortunately the geometry of the RFs (size, offsets, and strides) depends on all the layers in between. The value of the crop that you need should be selected in order to compensate for all these effects. So the easiest thing is to use print() to figure out the RF geometry before the crop layer, and then set the parameter of the latter to make the RF match.

To make things even more complex, there is no guarantee that a perfect RF collimation can be achieved. First, strides can be changed only by an integer multipliers (or divisors depending on the layer), but this is usually not an obstacle in practice. Second, depending on the stride and offsets of the layers that you want to compensate, there may not be a value of cropping in the crop/deconvolution layer that solves the problem exactly.

If you fiddle enough with the architecture, adjusting layers in between in addition to the crop/deconvolution layer, this can be achieved, but it took me a while to figure out (see our example fcnIinitModel16s compared to the FCN network). The FCN implementation achieves only approximate collimation, but this is probably not a problem as the slight misalignment of the feature fields should be something that learning can compensate.

barisgecer commented 8 years ago

What made you think that you should use [16 16 16 16] in the output of dagnn.print of this example?

I like to keep paddings in such a way that input and output of layers have same size (unless there is a stride). So currently, I just calculate aggregation of strides of all layers except deconv layer, and downsample my label (which have the same size as input) accordingly. Until here, I know which pixel correspond to which RF in the original input which make the life easier.

Then I would like to upsample it and keep the cropping size equal to filter size such that redundant pixels of the prediction are cropped out as they would be weighted aggregation of less pixels compared central region (i.e. when there is no cropping, boundary pixels of output of deconv layer would be effected only by boundary of input of deconv layer). In this way, label and prediction have exactly the same size. (One tiny detail, I had to ignore one pixel before upsampling because of that '-1')

I understand that it is really hard to align perfectly, but I think cropping is crucial to make the alignment logical so that learning can be done effectively. Do you think above alignment would make sense?

vlfeat / matconvnet

Why there is a '-1' in the output size formulation of vl_nnconvt #294