rdroste / unisal

Unified Image and Video Saliency Modeling (ECCV 2020)
https://arxiv.org/abs/2003.05477
Apache License 2.0
131 stars 37 forks source link

On the pooling method of the backbone network #1

Closed Hao-Liu closed 4 years ago

Hao-Liu commented 4 years ago

Great work! But I'm curious about the choice of the weird pooling method of mobilenet backbone. You didn't use a normal pooling method like average/max pooling or pooling by stride in convolution, but choose to directly slice a quarter of the input. I thought it'll stop the gradient for the other 75% of inputs when backprop and make these inputs useless, which doesn't make sense at all. Is this an intentional design or a random choice?

rdroste commented 4 years ago

Hi Hao Liu, thanks for your question and apologies for the long delay in getting back to you. I believe you are referring to this line? https://github.com/rdroste/unisal/blob/17ab7ddb40cae5196423aa31ba7e4eb2c4267581/unisal/models/MobileNetV2.py#L171 This is the same as having a strided convolution, i.e., discarding every second element. We apply the stride manually after the convolution because we copy the features before down-sampling for the skip connection that is input into the decoder. Does that answer your question?

rdroste commented 4 years ago

@Hao-Liu did my reply answer your question? Let me know if I can provide any further clarifications.

rdroste commented 4 years ago

Closing this issue due to inactivity