I got impression with you paper, thanks.
I have a question about feature dim in the proposed architecture.
I see that both 'vision-based classifier' and 'text-based classifier' have dim on 512.
But in many case after RoI-pooling layer(such as FasterRCNN), feature dim shows 2048 or 1024.
Did you change some configuration about it or set some layer?
I got impression with you paper, thanks. I have a question about feature dim in the proposed architecture. I see that both 'vision-based classifier' and 'text-based classifier' have dim on 512. But in many case after RoI-pooling layer(such as FasterRCNN), feature dim shows 2048 or 1024. Did you change some configuration about it or set some layer?
Thanks,