Closed mattroos closed 1 year ago
Hi, @mattroos the global pooling converts the HxWxD into 1x1xD feature to capture the global context. This singleton feature will be added to the HxWxD feature maps by simply duplicating across the spatial dimension. Please refer to this line of code and the following line:
Thanks, @jwyang! I see now that I was confusing capital "L" (total number of feature maps) with lower case "l" (the variable that identifies a given feature map).
In the FocalNet paper, it states:
I'm not finding any other details in the paper. Can @jwyang or others give more details? I don't understand how one starts with L feature maps and ends up with L+1 feature maps. If I understand correctly, the pooling is spatially, over the H x W dimensions. Once pooled, the dimension sizes will be smaller, e.g., a pooling size of 2 would give H/2 x W/2. So how are feature maps of smaller size added to maps of larger size, to get Zout?