pierluigiferrari / ssd_keras

A Keras port of Single Shot MultiBox Detector
Apache License 2.0
1.86k stars 935 forks source link

Why is the model unable to predict absolute bbox coordinates? #127

Closed jozefmorvay closed 6 years ago

jozefmorvay commented 6 years ago

This is not an issue, merely a question. First and foremost, the implementation is really useful, and thank you for making it public. Now, as for the question I have - in the main readme.md you state this:

"In order to be able to predict absolute box coordinates, the convolutional layers responsible for localization would need to produce different output values for the same object instance at different locations within the input image. This isn't possible of course: For a given input to the filter of a convolutional layer, the filter will produce the same output regardless of the spatial position within the image because of the shared weights."

I have read this paragraph (and the ones preceding and succeeding it) many times over, but still can't understand where the limitation comes from. How can any filter produce the same output regardless of the location of the object instance? Which filters do you even mean - the ones in the modified VGG, or the ones that are built on top of some of the layers to make predictions? What do you mean by shared weights? Please elaborate, I am really interested in understanding this in depth, and the official paper is rather terse with actual intuitive explanations of their work.

This is not the right place to ask this, I understand. Alas, I have no other channel to reach you. We can of course move this elsewhere if you deem it necessary.

pierluigiferrari commented 6 years ago

How can any filter produce the same output regardless of the location of the object instance?

That's exactly the point of a convolution. A fixed convolutional kernel (or filter) is being convolved with the input, which means taking the (Euclidean) inner product between the filter and the input at each spatial location. Hence, if for two distinct spatial locations the input is the same, then the resulting output is also the same.

As an example, assume a red car is in the top left corner of an image and the exact same red car is a second time in the bottom right corner of said image. Assume that as the kernel of a given convolutional layer slides across the input, it covers the two instances of the red car the same way. Then those elements of the output feature map that represent the two red cars will be identical for both instances of the car. Is that clear?

If that is clear, then it follows immediately that an object detector consisting of only convolutional layers cannot predict the absolute coordinates of the detected objects. The absolute box coordinates for the two red cars need to be different, but any convolutional feature map can only produce the same output for both.

Which filters do you even mean - the ones in the modified VGG, or the ones that are built on top of some of the layers to make predictions?

The statement is true for the filters of any convolutional layer.

What do you mean by shared weights?

"Shared weights" means that the input features at each spatial location will be multiplied with the same kernel / filter. The kernel is shared across all spatial locations. This is what a convolution is.

jozefmorvay commented 6 years ago

I now partially understand. Props for the quick answer. Still, consider this picture:

image

Picture taken from article http://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/

Suppose one of the said red cars occupies top left 5x5 square of the 14x14 input image (discounting the fact you need multiple channels for RGB for the sake of the example). Another car occupies bottom right corner, again 5x5. So when those cars are convolved with a filter that responds to 'red car' class, you could still tell where those cars occurred based on the receptive field in the resulting pixel/number in the feature map used for classification, such as the one right above the feat-map1. You can backtrack that single top-leftmost pixel back to the 5x5 square where the car is. Likewise, you can do the same for the other car.

I am likely missing something vital, but it appears as though you should be able to predict absolute pixel coordinates without using anchors and offsets.

pierluigiferrari commented 6 years ago

What you are describing is equivalent to the concept of anchor boxes. Either way you need some meta information that tells you how to translate the information "here is a red car" into the absolute location within the image. The point is that the mere output of the convolution itself does not contain this information. As illustrated above, the resulting output feature map values of any instances of that red car are all the same, regardless of the spatial location. Using the fact that you know where a given element of the output feature map is located spatially in order to compute the absolute location of the object based on that is exactly what the anchor boxes do.

pierluigiferrari commented 6 years ago

The question in the title is wrong by the way, and represents some of the confusion you have: The model is not unable to predict absolute box coordinates. The model obviously does predict absolute box coordinates. A convolutional layer is unable to predict absolute box coordinates.

jozefmorvay commented 6 years ago

The model can do so, of course, the caption is simply reworded sentence taken out of context of the rest of the readme:

"This may or may not be obvious to you, but it is important to understand that it is not possible for the model to predict absolute coordinates for the predicted bounding boxes".

Conv. layers are obviously limited to predicting anchors and classes. Is it right to assume that it is not outright impossible, but simply implausible for a theoretical model (not the modified VGG, but something far deeper and more computationally intensive) to predict absolute coordinates in the form of anchor boxes themselves? Suppose you don't have to resize input image to 300/512 pixel edges, but can leave the picture at thousands pixels per edge, and can afford to stack huge number of 3x3 convolutions, reducing each edge size by only 2 pixels by passing through an individual layer (no padding). That way, you could access basically all possible anchor boxes in the image and would have no need for bbox offsets. Is this assumption correct? It has no founding in the real world, clearly, but still.

pierluigiferrari commented 6 years ago

The model can do so, of course, the caption is simply reworded sentence taken out of context of the rest of the readme:

Good point, I need to make the wording at that point more precise.

Conv. layers are obviously limited to predicting anchors and classes. Is it right to assume that it is not outright impossible, but simply implausible for a theoretical model (not the modified VGG, but something far deeper and more computationally intensive) to predict absolute coordinates in the form of anchor boxes themselves? Suppose you don't have to resize input image to 300/512 pixel edges, but can leave the picture at thousands pixels per edge, and can afford to stack huge number of 3x3 convolutions, reducing each edge size by only 2 pixels by passing through an individual layer (no padding). That way, you could access basically all possible anchor boxes in the image and would have no need for bbox offsets. Is this assumption correct? It has no founding in the real world, clearly, but still.

If I understand your described construction correctly, then no, the assumption is not correct. It doesn't matter how many sequential convolutional layers you have, their output is still invariant with respect to the spatial dimensions of their respective input feature maps and in order to get absolute coordinates you need meta information (i.e. the spatial location of a feature map element), which, as said before, is what the anchor boxes are for.

jozefmorvay commented 6 years ago

What I meant was that with many conv layers, you could have access to so many anchor boxes with so many receptive fields, that one of them would bound to be a good match for the ground truth box. Although they would always be square in shape, whereas GT box can be of any shape, so for that you would still need some sort of offset. I am going to debug the code step by step to better understand what's going on.

One more crucial thing remains unclear to me - how can the network learn proper offsets for the anchor box? It baffles me quite a lot - it's just convolutions, after all, nothing more. Returning to the red car example, let's suppose we have two training images of the red cars. In the first image, the car is in the top left part of the image. There would be an anchor box predicted and selected that best matches the ground truth box. So we have our offsets. Now the second training example comes along with car in the top left, but a few pixels to the right with a slightly different GT box, and again there would be a best match selected, but with different offset. How can convolutions accomplish this? At inference time, a third red car image is passed through the network, but this time with the car in the bottom right corner. How in the world can the network know how to predict offsets, when all it has available is anchor box of the car, and so far it has only seen red cars in the top left corner of images?

pierluigiferrari commented 6 years ago

At inference time, a third red car image is passed through the network, but this time with the car in the bottom right corner. How in the world can the network know how to predict offsets, when all it has available is anchor box of the car, and so far it has only seen red cars in the top left corner of images?

There is a fundamental lack of understanding of both convolutions and the concept of anchor boxes here.

  1. Regarding convolutions: A convolutional layer slides some filter across a number of spatial positions of the layer's input. If a certain pattern (e.g. representing a red car) occurs in two distinct spatial positions in the input (but with the same position relative to the receptive fields of the respective elements in the output feature map), then that pattern will produce the same output in both spatial positions of the output feature map. This means that the filter weights will also have the same gradient (locally, at least) with respect to the input. Hence, in order to train this filter to recognize certain patterns (red cars), it doesn't matter in which spatial sector of the input the pattern occurs, the filter will be adjusted the same way regardless. You could train a network with red cars appearing only in the upper left area of the training images and the trained network would still be able to detect similar red cars at every other position in an image, because it is the same filter convolving with the input at every spatial position. If this isn't clear, you should spend some time on understanding convolutions, because this very relationship is the reason why convolutions are useful for image data.
  2. Regarding anchor boxes: The trained model doesn't have an anchor box "available". The trained weights of the network's localization layers don't even know that anchor boxes exist. They were just trained to output certain numbers whenever a certain pattern occurs in the input. One of these numbers happens to be a horizontal offset with respect to the center point of an imaginary reference box (i.e. an anchor box), another happens to be the width relative to this imaginary reference box.

One more crucial thing remains unclear to me - how can the network learn proper offsets for the anchor box? It baffles me quite a lot - it's just convolutions, after all, nothing more.

Taking up the second point above, the convolutional localization layers are just trained to output some numbers based on their input. Whatever numbers you give them as ground truth for a given input pattern, those are the numbers they will be trained to output when that pattern appears. If the center point of the ground truth box for an object deviates by a certain number of pixels horizontally from the the center point of an imaginary anchor box that some convolutional filter is supposed to learn, then you can "give" this convolutional filter that deviation as the ground truth and it will learn that it has to output this number for this input pattern. The convolutional layer simply learns to output whatever you tell it to via the ground truth you provide it with, given that this is possible (i.e. the ground truth isn't contradictory within itself).

jozefmorvay commented 6 years ago

Thanks.