sitzikbs / netVLAD

netVLAD implementation in TensorFlow
67 stars 21 forks source link

Is it possible to replace the way computing soft weights with simple conv? #2

Closed Kimilovesy closed 5 years ago

Kimilovesy commented 6 years ago

Hi,

First, thanks for sharing the code, get a lot of inspirations. While I am wondering whether it is possible to replace the way you compute the weights that feed into the VLAD core in the original paper with simple tf.conv2(...)? As shown in the pic, actually, the size of the filters are 1 X 1 X Dim X # of Clusters. It would be nice if you can check whether my understanding is correct. Thanks in advance!

screen shot 2018-03-13 at 10 50 32
sitzikbs commented 6 years ago

Im not sure I understand the question. What about the centers ?

Kimilovesy commented 6 years ago

In my opinion, the centers are trainable, and they are in the same dim as the input. So we only need to initialize the centers as [# of clusters, dim] using Xavier method or anything else as you also mentioned initialize with K-means does not yield any performance gain. At each training step, we will get updated clusters, right?

Kimilovesy commented 6 years ago

And my question is whether we can replace this part of code with simple convolution? As shown in the original paper, they use 1x1xDxK filters.

screen shot 2018-03-13 at 11 24 16 screen shot 2018-03-13 at 11 24 39
sitzikbs commented 6 years ago

I am a bit rusty on the details (not my paper, just implemented it) but your idea seems ok. Notice however that the weights are only shared across k, so, i think you can try to replace the code. let me know if it comes out the same.

pdpdpd2013 commented 6 years ago

Hi,

I am also working on netVLAD for place recognition.

In my opinion, VLAD layer is not just another conv2d layer. So you cannot replace it with tf.conv(). The reason is that VLAD layer accumulates the global residual of all pixels according to Equation (4) in the original paper https://arxiv.org/pdf/1511.07247.pdf

For VLAD layer, each pixel in the output (size WxHx64) contains information from all pixels of the input (size WxHx512). If you use conv2d layer instead, each pixel in the output only contains information from local pixels of the input.

sitzikbs commented 5 years ago

Closing due to lack of activity.