seoungwugoh / STM

Video Object Segmentation using Space-Time Memory Networks
405 stars 81 forks source link

Confusion about the memory encoder implementation #11

Closed xmlyqing00 closed 4 years ago

xmlyqing00 commented 4 years ago

Hi,

Thanks for your outstanding model and well implementation. I have a question about memory encoder. In the class Encoder_M, you sum up the frame and the mask at the very beginning:

x = self.conv1(f) + self.conv1_m(m) + self.conv1_o(o) 

However, it is confusing that in your paper, you say

The inputs are concatenated along the channel dimension before being fed into the memory encoder. For the memory encoder, the first convolution layer is modified to be able to take a 4-channel tensor by implanting additional single channel filters.

Could you explain this difference or talk more about the intuition behind your implementation? Thanks in advance.

xmlyqing00 commented 4 years ago

Thanks for the author's reply. They are equivalent.

pixelsmaker commented 4 years ago

Thanks for the author's reply. They are equivalent.

why are they equivalent?

seoungwugoh commented 4 years ago

@pixelsmaker It is because convolution is a linear operation. Thus, applying a single convolution after concatenating all the inputs are exactly equivalent to summing outputs from separate convolution applied to each input.