Closed TheSunWillRise closed 5 years ago
Hi, thanks for your comment. The code is a bit misleading here but the scene information is not used. However I also had to read through it again to understand what is happening and be sure this is the case.
What happens is:
decoder_input = slim.layer_norm(masked, scale=False, center=False, scope='matching_normalization')
Therefore the effect is the same as never having scaled the target representation at all.
I have to admit, this is more than confusing and scaling the target with the matching instead of simply tiling it's representation is unnecessary. However as the layer norm effectively reverts the process it should be correct and working as stated in the paper.
Thanks for pointing us to this needless complication. I guess this is not the only place where the code is unnecessarily complex. I will try to find some time to go over it and remove similar confusing elements.
Hi, it is really a nice work, and the task proposed in your paper is a valuable research issue. But I have a question about MaskNet: according to chapter 4.1 and fig3.B of your paper, the proposals are generated from one-hot vectors and target encoder outputs, and scene encoder outputs are not used to generate proposals. However, in you codes, the function mask_net(codes are showed below) generates proposals from one-hot vectors and the output of function matching_filter which takes both targets_encoded and images_encoded as inputs. Is there any thing wrong? ``
`
`