Siamese Vs. Single network in training

wenjie710 commented 2 years ago

Hi, I have read your paper and the code. The work is cool and fantastic. However, I am confused about the siamese network here. It is said in the paper that these two networks share weights and the same structure, so what is the difference if the tensors are input only once using single network? Like, how about we concatenate these two tensors into one batch and input it to the network? Could you please provide some ablation study on Siamese Vs. Single network in MSLS? Thanks a lot.

marialeyvallina commented 2 years ago

Hi @wenjie710 We follow a traditional siamese network architecture, which is essentially one network at which we feed two images (x_i and x_j), with a certain similarity label (that we denote psi). So the pseudo-code pipeline is as follows:

We first forward two images. Please note that although the network and its weights are the same, the inputs are different, and the outputs are therefore different too.

out_i=network.forward(x_i)
out_j=network.forward(x_j)

We calculate the loss:

error = loss.forward(out_i, out_j, psi)

And finally, we backpropagate and update the network weights.

error.backward()
network.update()

This is also illustrated in our paper, Fig. 2.

The batches are constructed with pairs of images, so for a batch of size N you'll have batch_xi of shape (N, 3, w, h) and batch_xj of shape (N,3,w,h), which you will forward to the network following the pipeline above.

To the best of my knowledge, all deep learning work in visual place recognition or image retrieval for MSLS or other datasets follows a siamese or triplet architecture, so I am not aware of the existence of any ablation study of Siamese Vs. Single network.

Please do not hesitate to send us an email if you have further questions.

wenjie710 commented 2 years ago

"The batches are constructed with pairs of images, so for a batch of size N you'll have batch_xi of shape (N, 3, w, h) and batch_xj of shape (N,3,w,h), which you will forward to the network following the pipeline above."

OK. Got it. So I guess there is no difference if we input (2N, 3, w, h) and split the output (2N, 3, ww, hh) into (N, 3, ww, hh) and (N, 3, ww, hh).

Thanks for the reply.

mlopezantequera commented 2 years ago

I agree there's no difference

On Thu, Nov 4, 2021, 04:49 Wenjie @.***> wrote:

"The batches are constructed with pairs of images, so for a batch of size N you'll have batch_xi of shape (N, 3, w, h) and batch_xj of shape (N,3,w,h), which you will forward to the network following the pipeline above."

OK. Got it. So I guess there is no difference if we input (2N, 3, w, h) and seperate the output (2N, 3, ww, hh) into (N, 3, ww, hh) and (N, 3, ww, hh).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/marialeyvallina/generalized_contrastive_loss/issues/4#issuecomment-960432900, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAH4OZ6OWADZ2X2NQ66HYK3UKH7CZANCNFSM5HH53LGQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

marialeyvallina / generalized_contrastive_loss

Siamese Vs. Single network in training #4