Closed divyagupta25 closed 1 year ago
hello,
traditionally in VPR, during training, images within 10 m are considered suitable to be mined as positives. Whereas during inference, usually matches are considered correct if within 25 mt. The reason during training only images within 10 m are considered is that they have a higher chance of having high visual overlap with the query. However, even if we choose positives within 10 m, we want to avoid using as negatives images in the range (10, 25) meters, as this would push them away from the query; and even if they might have not-so-high visual overlap, they would still be considered correct matches if ever retrieved. For this reason during training we keep this distinction; and soft positives are images within 25 m. We need this distinction so that when we select positives, we take them from hard_positives. When we select negatives, we make sure to not select any of them from the soft_positives list. during inference we use only hard_positives, but changing the threshold to 25 m, as we do not need to do mining .
note that this discussion only makes sense for training; for inference we do not care about visual overlap and we are just happy if for whatever reason we retrieva images within 25 m. In VPR there is no clear definition of positive images based on coordinates.Seminal works like netvlad were developed without heading information, and thus relied on mining to select positives: among images within 10 m, we select the one with highest feature similarity to the query. even though more recent datasets like MSLS and SF-XL do provide the heading information, and in the Cosplace paper it is used to build classes, and it does classification rather than metric learning via triplet loss. For retrieval it has never been studied the usage of heading to define positives; for this reason following common practices in literature, we do not try to use heading, simply use 10 m as threshold (during training) for hard positives, and use the network to extract features and pick the closest candidates in feature space, which thus are most likely to have high visual overlap
it can happen (and it does in MSLS) that a query sequence does not correspond to any database sequences within 25 m.
the parameter 'cut_last_frame' is needed because the way the code that builds sequence is written, it only works with odd sequence lengths (3, 5, 7..) ; thus if you wanted to use even sequences, it can be done by setting cut_last_frame true which then cuts the last frame. although usually 5 is the length used
that comment is slightly imprecise. it means that if it is training time, we need to return triplets of query,pos,neg. Instead if it is inference, or we need to initialize netvlad, we will return a single sequence ( 'single image' in the comment)
if you look at the code, after the line 328 that you mentioned, that function computes the mining. In the end, it will fill the global_triplets_index.
https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L351
so in the end it contains the indexes of query, positive, and negative sequences
the structure self.pIdx
contains for each query sequences the indexes of hard positive sequences
hope I was clear
Thank you so much for your response. All my doubts are cleared.
I have a few more queries and was hoping if you could please elaborate on these.
Why do q
and self.qIdx[q]
have different values?
https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L340
How are those negatives found that most violate the triplet constraint through this function? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L315
What are descriptors_num
and descs_num_per_image
in initialize_seqvlad_layer()
?
What is this, and why is it changed in line 239? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L183
Thanks!
Hello,
Among all the sequences in 'queries_paths', there are some that do not have any positive matches in the database. For this reason, qIdx contains the indices of query sequences that actually have matches. In the line that you cited, for the mining we sample queries among those with positives; so once we have 'q', we use qIdx to retrieve 'qidx', which is the index to access the actual image paths
that function receives the features of the candidate negative samples. These are all samples further than 25 m from the query. Then using faiss we compute a knn, to find the samples whose descriptors are closest to the query in feature space. In this way we select samples that, although far away in geographical space, are close in feature space, and thus are visually similar . These are called 'hard negatives', and they most violate the triplet constraint
seqvlad, in the same way as netvlad, needs to initialize its cluster centroids. That function is copied from NetVLAD initialization; we take 50k descriptors; 100 per image. These numbers are arbitrary. So we take 500 images, get their descriptors and compute the clustering on them to initialize centroids
self.features_dim
is the size of the final descriptors. which is given by dim of a single features, times the number of centroids. In line 239 it is not actually changed, it is used in a separate variable; it is divided by 64 because 64 is the number of centroids, and so we get the dim of a single feature. I do recognize that is poorly written, with that hardcoded 64
we take the first frame of each sequence because we kept the same initialization as netvlad , and it uses single images.
hope I was clear
Thanks a lot for the clarification. I have a few follow up questions on the initialization of cluster centroids.
From the link that you have cited, they do not use this method to compute alpha (assigned manually) or centroids (initialized randomly)
Could you please tell how did you arrive at this?
https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L198
If I understand, here dots
represents the dot product of each cluster centroid with each of the 50k descriptors. Why is it then sorted and the mean of the difference of the first two rows taken?
Why have you used the normalized centroids for initializing w_k https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L201 but the unnormalized centroids for initializing c_k? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L199
Hello, sorry for the delay.
the answer is the same to the other questions I guess
Thank you for your response! Both the links that you have given are the same. Is there any typo here?
yes sorry, this is the right link https://github.com/Nanne/pytorch-NetVlad/blob/master/netvlad.py
Thank you so much for your help! Closing the thread
Hi, I want to understand the training of CCT384 + SeqVLAD model on MSLS. I have some doubts in dataset loading.
self.triplets_global_indexes
is empty? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L328Could you please help me with these? Thanks in advance!