liutinglt / CE2P

214 stars 41 forks source link

Question about the configuration of the backbone #28

Closed lijing1996 closed 5 years ago

lijing1996 commented 5 years ago

Thanks for sharing your great work first! I observe a change over the configuration of the backbone compared with the raw version of PSPNet. In the raw version of PSPNet(the code you bring from), the configuration of the backbone is: self.layer3 = self._make_layer(block, 256, layers[2], stride=1, dilation=2) self.layer4 = self._make_layer(block, 512, layers[3], stride=1, dilation=4, multi_grid=(1,1,1)) while in CE2P it is: self.layer3 = self._make_layer(block, 256, layers[2], stride=2) self.layer4 = self._make_layer(block, 512, layers[3], stride=1, dilation=2, multi_grid=(1,1,1)) In my view, human parsing is a finer task compared with generic scene parsing, I understand the reason you change the dilation rate, but why do you downsample 4 times and the size of feature map before the PSP Module is only 1/16 of the input image while the raw PSPNet downsample 3 times and the size is 1/8? I think it doesn't make sense and doubt whether there is a mistake. Is there an experiment showing the effectiveness of your modification? Thanks a lot and really looking forward to your reply~

liutinglt commented 5 years ago

@lijing1996 Since we have introduced an edge module and decoder module, the features are high-resolution. In our experiments, we found that 1/8 has little improvements compared with 1/16, but the memory and training time increase many times. You can also try 1/8.