Output size of extractor.forward_pass()

model_output[1][0].shape -> (1, 3, 9, 13, 85) -> (1, 351, 85) model_output[1][1].shape -> (1, 3, 18, 26, 85) -> (1, 1404, 85) model_output[1][2].shape -> (1, 3, 36, 52, 85) -> (1, 5616, 85): this is what `forward_pass_on_convolutions(x)` returns.

It seems to me that a direct forward pass via model(x) and using the extractor's forward pass through forward_pass_on_convolutions(x) gives outputs of different sizes.

forward_pass_on_convolutions(x) outputs a tensor of size (1, 477360), which is the flattened form of (1, 3, 36, 52, 85) -> (1, 5616, 85) -> (1, 477360).

However, using model_output = self.model(x)gives multiple outputs: model_output[0] has shape (1, 7371, 85), as opposed to (1, 5616, 85) we previously obtained. I turned to model_output[1], which is a list of size 3, to understand what's going on:
model_output[1][0].shape -> (1, 3, 9, 13, 85) -> (1, 351, 85)
model_output[1][1].shape -> (1, 3, 18, 26, 85) -> (1, 1404, 85)
model_output[1][2].shape -> (1, 3, 36, 52, 85) -> (1, 5616, 85): this is what `forward_pass_on_convolutions(x)` returns.
Now, concatenating these along axis 1 gives us: (1, 351 + 1404 + 5616, 85) -> (1, 7371, 85): this is the shape of model_output[0].

The YOLOv2/YOLO9000 paper mentions the following:

Fine-Grained Features.This modified YOLO predicts detections on a 13 × 13 feature map. While this is sufficient for large objects, it may benefit from finer grained features for localizing smaller objects. Faster R-CNN and SSD both run their proposal networks at various feature maps in the network to get a range of resolutions. We take a different approach, simply adding a passthrough layer that brings features from an earlier layer at 26 × 26 resolution.

I infer from this that a similar feature is at work here, and results from 3 different resolutions are brought together as outputs, and concatenated to produce an output of size (1, 7371, 85). However, forward_pass_on_convolutions(x) only provides the outputs of the 3rd resolution, hence the equality with model_output[1][2].shape -> (1, 5616, 85).

In light of these, I have two questions:

Why does forward_pass_on_convolutions(x) not include the outputs of the other resolutions? It seems like in the current setting we are backpropagating with incomplete target outputs (the shape of the target outputs we generate in generate_cam are also (1, 5616, 85)).

As a solution, I tried to generate 3 target tensors with sizes that correspond to the 3 resolutions, but only the one with size (1, 5616, 85) can be backpropagated, the others expectedly fail on model_output.backward() due to size incompatibility. How can I go around this so that the other sizes can be backpropagated as well?

Many thanks for the help in advance. Hi, how do you fix this problem? x = x + layer_outputs[mdef["from"]] TypeError: list indices must be integers or slices, not list

Looking forward to your help.

pifalken / YOLOv3-GradCAM

Output size of extractor.forward_pass() #7