How is global average pooling done, in detail?

mattroos commented 1 year ago

"To capture global context of the whole input, which could be high-resolution, we apply a global average pooling on the L-th level feature map Z(L+1) = Avg-Pool(Z(L)). Thus, we obtain in total (L+1) feature maps"

I'm not finding any other details in the paper. Can @jwyang or others give more details? I don't understand how one starts with L feature maps and ends up with L+1 feature maps. If I understand correctly, the pooling is spatially, over the H x W dimensions. Once pooled, the dimension sizes will be smaller, e.g., a pooling size of 2 would give H/2 x W/2. So how are feature maps of smaller size added to maps of larger size, to get Zout?

jwyang commented 1 year ago

Hi, @mattroos the global pooling converts the HxWxD into 1x1xD feature to capture the global context. This singleton feature will be added to the HxWxD feature maps by simply duplicating across the spatial dimension. Please refer to this line of code and the following line:

https://github.com/microsoft/FocalNet/blob/ca2f105ffb44fd6e6ab6ca04259f8fea3252913d/classification/focalnet.py#L87

mattroos commented 1 year ago

Thanks, @jwyang! I see now that I was confusing capital "L" (total number of feature maps) with lower case "l" (the variable that identifies a given feature map).

microsoft / FocalNet

How is global average pooling done, in detail? #16