microsoft / FocalNet

[NeurIPS 2022] Official code for "Focal Modulation Networks"
MIT License
682 stars 61 forks source link

How is global average pooling done, in detail? #16

Closed mattroos closed 1 year ago

mattroos commented 1 year ago

In the FocalNet paper, it states:

"To capture global context of the whole input, which could be high-resolution, we apply a global average pooling on the L-th level feature map Z(L+1) = Avg-Pool(Z(L)). Thus, we obtain in total (L+1) feature maps"

I'm not finding any other details in the paper. Can @jwyang or others give more details? I don't understand how one starts with L feature maps and ends up with L+1 feature maps. If I understand correctly, the pooling is spatially, over the H x W dimensions. Once pooled, the dimension sizes will be smaller, e.g., a pooling size of 2 would give H/2 x W/2. So how are feature maps of smaller size added to maps of larger size, to get Zout?

jwyang commented 1 year ago

Hi, @mattroos the global pooling converts the HxWxD into 1x1xD feature to capture the global context. This singleton feature will be added to the HxWxD feature maps by simply duplicating across the spatial dimension. Please refer to this line of code and the following line:

https://github.com/microsoft/FocalNet/blob/ca2f105ffb44fd6e6ab6ca04259f8fea3252913d/classification/focalnet.py#L87

mattroos commented 1 year ago

Thanks, @jwyang! I see now that I was confusing capital "L" (total number of feature maps) with lower case "l" (the variable that identifies a given feature map).