sacmehta / ESPNet

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation
https://sacmehta.github.io/ESPNet/
MIT License
541 stars 112 forks source link

A few questions about the structure design of ESPNet #15

Closed kuoweilai closed 5 years ago

kuoweilai commented 6 years ago

Hi sacmehta,

First of all, really a great work! ESPNet is elegant and efficient. And after I look into your paper and code, I have some questions on your decisions of strategy and structure design.

(1) I found that in the provided ESPNet model, you did not utilize another Conv-1(19, C) for skip connection at last concat part in the decoder, which is different from the Fig.8 (d)ESPNet. Did you abandon that skip connection simply for better performance?

(2) Why did you design the ESPNet to be trained in two stage? Does it because in this way, the encode feature can be learned directly and the decoder part just serves for upsampling? Did you train end to end on ESPNet and if so could you share some insight of this?

(3) I notice that while you train the ESPNet-C, you downsample the groundtruth label to 1/8 resolution. However, would it be better that we upsample 8x the output feature of ESPNet-C by bilnear to calculate loss?

(4) You decided to use another channel for background (not use for evaluation), and did you conduct experiment on the performance difference of without and with this additional channel? Without additional background channel in the case of Cityscapes dataset, we can have better visual results and also a little bit smaller model. Any reason that drives you to add another background channel? (like for better mIoU?)

Thank you so much for your time.

Best,

Kuo-Wei

sacmehta commented 6 years ago

Thanks for your interest in our work.

1) Architecture in Figure 4 of the paper is generic. First skip connection between encoder and decoder makes sense if C is different than number of classes in the dataset. For cityscapes, number of classes is close to 19 so it does not make sense to add it. We experimentally found that this connection is irrelevant for the Cityscapes dataset.

2) Usually, semantic segmentation architectures use a pretrained encoder such as ResNet which is trained on the ImageNet. We did not use a pretrained encoder, that is why we need to adhere to two stage strategy. Also, we found that training end-to-end models from scratch are less accurate than 2 stage accuracy.

3) You could upsample the feature map and compute the loss at original image resolution instead of 1/8th of the image.

4) In general, ignoring the background is not a good idea, especially when considering the generalizability. I would like to emphasize that the aim of ESPNet is to build a network that is efficient with reasonable accuracy.