Open jamiechoi1995 opened 6 years ago
because the output feature may not be 7x7
@ruotianluo
I think you means the att features, but what I mean is the fc features,
seems that you use the average conv features of all localtions as fc feature. (similiar to"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering", I think)
But what I thought before is that fc feature is the feature of fully connected layer.
I also see that adaptive_avg_pool2d https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/622b6a5ffe9ee599911306b464dfa1ed2a19fa37/misc/resnet_utils.py#L25 can not only specify the att size but also allow the model to accept images of arbitrary size, good implementation.
Hi,
I'm curious about the way you extract fc feature from resnet,
why did you use https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/622b6a5ffe9ee599911306b464dfa1ed2a19fa37/misc/resnet_utils.py#L24
instead of
x = self.resnet.avgpool(x) fc = x.view(x.size(0), -1)
as defined in https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/622b6a5ffe9ee599911306b464dfa1ed2a19fa37/misc/resnet.py#L149