microsoft / DynamicHead

MIT License
624 stars 60 forks source link

DyHead PyTorch Implementation #10

Open Coldestadam opened 3 years ago

Coldestadam commented 3 years ago

Hi there,

Please look at my notes in my readme text file, hopefully, you can use this code to get a head start or to at least have the community have something to finish the implementation. If something, is incorrect in the code, I would really like to see the mistakes to see the things I misunderstood.

Thanks, Adam

ghost commented 3 years ago

CLA assistant check
All CLA requirements met.

jerryzhang-ss commented 3 years ago

Hi Adam,

Thanks a lot for your great work, it is really inspiring! However, I personally have some questions regarding scale-aware attention:

  1. It seems we need to know the size of dimension S for this layer. But when we are dealing with multi-scale input that the feature map output from feature extractor is arbitrary, what should we do?
  2. From equation (3), did they use a global average pooling along dimension (S,C)? If that were the case, we might not need the size of dimension S. But what would be the purpose of 1x1 conv layer?

Please correct me if I did not understand it correctly. Again, thank you so much for your effort!

Jerry

Coldestadam commented 3 years ago

Hi Jerry,

I am not one of the authors of this paper, but I will try to answer them in the way I understood it. Just keep in mind that I might have things wrong, as what I said in my github repo.

  1. I think the answer to your question is in section 3.1. Basically what the authors did was reshape all the output feature maps of the Feature Pyramid Network(FPN) into one tensor with dimensions (L, S, C). The way they described that was by finding the one output feature map with the median Height and Width (H x W), then you reshape all the other output feature maps by downsampling or upsampling. This was tricky on my part since the built-in RCNN-FPN models in Pytorch have four output feature maps from the FPN, so I decided to just calculate the median of all the heights and widths and then reshaped them to the median size. Therefore after each output has a constant height and width, I concatenated all the outputs into one tensor so the dimensions became (L, H, W, C). Then I just flattened the dimensions of that tensor on the height and width dimension to have the tensor be (L, S, C), then this tensor is passed into the DyHead or any of the individual blocks.

  2. I want you to refer to Figure 1, where (pi_L) is being multiplied by F, pi_L has dimensions (L, C). The 1x1 convolution layer is being done to reduce the dimension S. To do that, the tensor F with dimensions (L, S, C) is transposed to dimensions (S, L, C) then the convolutional layer treats (L, C) as (Height, Width). I admit that the equation makes it confusing, but that is the way I understood it from Figure 1. the 1x1 global average pooling is meant to approximate the function f in that equation. The output of that is passed into the relu and sigmoid, then that is multiplied to F to be the output of the scale-attention layer. Does that make sense?

Also thanks for your kind words. Thanks, Adam

jerryzhang-ss commented 3 years ago

Hi Jerry,

I am not one of the authors of this paper, but I will try to answer them in the way I understood it. Just keep in mind that I might have things wrong, as what I said in my github repo.

  1. I think the answer to your question is in section 3.1. Basically what the authors did was reshape all the output feature maps of the Feature Pyramid Network(FPN) into one tensor with dimensions (L, S, C). The way they described that was by finding the one output feature map with the median Height and Width (H x W), then you reshape all the other output feature maps by downsampling or upsampling. This was tricky on my part since the built-in RCNN-FPN models in Pytorch have four output feature maps from the FPN, so I decided to just calculate the median of all the heights and widths and then reshaped them to the median size. Therefore after each output has a constant height and width, I concatenated all the outputs into one tensor so the dimensions became (L, H, W, C). Then I just flattened the dimensions of that tensor on the height and width dimension to have the tensor be (L, S, C), then this tensor is passed into the DyHead or any of the individual blocks.
  2. I want you to refer to Figure 1, where (pi_L) is being multiplied by F, pi_L has dimensions (L, C). The 1x1 convolution layer is being done to reduce the dimension S. To do that, the tensor F with dimensions (L, S, C) is transposed to dimensions (S, L, C) then the convolutional layer treats (L, C) as (Height, Width). I admit that the equation makes it confusing, but that is the way I understood it from Figure 1. the 1x1 global average pooling is meant to approximate the function f in that equation. The output of that is passed into the relu and sigmoid, then that is multiplied to F to be the output of the scale-attention layer. Does that make sense?

Also thanks for your kind words. Thanks, Adam

Hi Adam,

Thanks for your quick response and detail explaination, it makes your thoughts more clear.

Sorry that I didn't describe my question well at the first place. By "multi-sclae input", I actually meant raw input shape. Some detection frameworks like detectron2 support keep-ratio-resizeing with range of shortest edge value, like here. This can improve the robustness of the detection model, but it will cause the feature shape out from backbone to be arbitrary. So if we fix the s_size, we would probably fail on this scenario.

qdd1234 commented 2 years ago

Hi, thanks for your reproduction. I have a question that If I apply Dynamic head to ATSS, the final prediction branch is one? Because ATSS uses three scales to predict which is shown in the picture and Dynamic head needs to concatenate the output of FPN so that the prediction scale is one? image