Question about the configuration of the backbone

Thanks for sharing your great work first! I observe a change over the configuration of the backbone compared with the raw version of PSPNet. In the raw version of PSPNet(the code you bring from), the configuration of the backbone is: self.layer3 = self._make_layer(block, 256, layers[2], stride=1, dilation=2) self.layer4 = self._make_layer(block, 512, layers[3], stride=1, dilation=4, multi_grid=(1,1,1)) while in CE2P it is: self.layer3 = self._make_layer(block, 256, layers[2], stride=2) self.layer4 = self._make_layer(block, 512, layers[3], stride=1, dilation=2, multi_grid=(1,1,1)) In my view, human parsing is a finer task compared with generic scene parsing, I understand the reason you change the dilation rate, but why do you downsample 4 times and the size of feature map before the PSP Module is only 1/16 of the input image while the raw PSPNet downsample 3 times and the size is 1/8? I think it doesn't make sense and doubt whether there is a mistake. Is there an experiment showing the effectiveness of your modification? Thanks a lot and really looking forward to your reply~

liutinglt / CE2P

Question about the configuration of the backbone #28