Doubts in dataset loading

divyagupta25 commented 1 year ago

Hi, I want to understand the training of CCT384 + SeqVLAD model on MSLS. I have some doubts in dataset loading.

Could you please explain what is meant by soft and hard positives per query, and why are both used in training but only hard positives during inference?
Shouldn't viewing angle also be used along with utm coordinates to compute hard (i.e., definite) positives? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L101
How can a query sequence have no positives? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L136
What is the use of cutting the last frame? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L63
Could you please explain this? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L248
(PyTorch doubt) How do we get anything here https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L288 since self.triplets_global_indexes is empty? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L328
How is the 10 meters constraint being ensured here? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L310

Could you please help me with these? Thanks in advance!

ga1i13o commented 1 year ago

hello,

traditionally in VPR, during training, images within 10 m are considered suitable to be mined as positives. Whereas during inference, usually matches are considered correct if within 25 mt. The reason during training only images within 10 m are considered is that they have a higher chance of having high visual overlap with the query. However, even if we choose positives within 10 m, we want to avoid using as negatives images in the range (10, 25) meters, as this would push them away from the query; and even if they might have not-so-high visual overlap, they would still be considered correct matches if ever retrieved. For this reason during training we keep this distinction; and soft positives are images within 25 m. We need this distinction so that when we select positives, we take them from hard_positives. When we select negatives, we make sure to not select any of them from the soft_positives list. during inference we use only hard_positives, but changing the threshold to 25 m, as we do not need to do mining .
note that this discussion only makes sense for training; for inference we do not care about visual overlap and we are just happy if for whatever reason we retrieva images within 25 m. In VPR there is no clear definition of positive images based on coordinates.Seminal works like netvlad were developed without heading information, and thus relied on mining to select positives: among images within 10 m, we select the one with highest feature similarity to the query. even though more recent datasets like MSLS and SF-XL do provide the heading information, and in the Cosplace paper it is used to build classes, and it does classification rather than metric learning via triplet loss. For retrieval it has never been studied the usage of heading to define positives; for this reason following common practices in literature, we do not try to use heading, simply use 10 m as threshold (during training) for hard positives, and use the network to extract features and pick the closest candidates in feature space, which thus are most likely to have high visual overlap
it can happen (and it does in MSLS) that a query sequence does not correspond to any database sequences within 25 m.
the parameter 'cut_last_frame' is needed because the way the code that builds sequence is written, it only works with odd sequence lengths (3, 5, 7..) ; thus if you wanted to use even sequences, it can be done by setting cut_last_frame true which then cuts the last frame. although usually 5 is the length used
that comment is slightly imprecise. it means that if it is training time, we need to return triplets of query,pos,neg. Instead if it is inference, or we need to initialize netvlad, we will return a single sequence ( 'single image' in the comment)
if you look at the code, after the line 328 that you mentioned, that function computes the mining. In the end, it will fill the global_triplets_index.
https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L351

so in the end it contains the indexes of query, positive, and negative sequences

the 10 m constraint is respected simply because we sample potential positives from the list of hard positives. https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L334

the structure self.pIdx contains for each query sequences the indexes of hard positive sequences

hope I was clear

divyagupta25 commented 1 year ago

Thank you so much for your response. All my doubts are cleared.

I have a few more queries and was hoping if you could please elaborate on these.

Why do q and self.qIdx[q] have different values? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L340
How are those negatives found that most violate the triplet constraint through this function? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/datasets/dataset.py#L315
What are descriptors_num and descs_num_per_image in initialize_seqvlad_layer()?
What is this, and why is it changed in line 239? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L183
Why? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L246

Thanks!

ga1i13o commented 1 year ago

Hello,

Among all the sequences in 'queries_paths', there are some that do not have any positive matches in the database. For this reason, qIdx contains the indices of query sequences that actually have matches. In the line that you cited, for the mining we sample queries among those with positives; so once we have 'q', we use qIdx to retrieve 'qidx', which is the index to access the actual image paths
that function receives the features of the candidate negative samples. These are all samples further than 25 m from the query. Then using faiss we compute a knn, to find the samples whose descriptors are closest to the query in feature space. In this way we select samples that, although far away in geographical space, are close in feature space, and thus are visually similar . These are called 'hard negatives', and they most violate the triplet constraint
seqvlad, in the same way as netvlad, needs to initialize its cluster centroids. That function is copied from NetVLAD initialization; we take 50k descriptors; 100 per image. These numbers are arbitrary. So we take 500 images, get their descriptors and compute the clustering on them to initialize centroids
self.features_dim is the size of the final descriptors. which is given by dim of a single features, times the number of centroids. In line 239 it is not actually changed, it is used in a separate variable; it is divided by 64 because 64 is the number of centroids, and so we get the dim of a single feature. I do recognize that is poorly written, with that hardcoded 64
we take the first frame of each sequence because we kept the same initialization as netvlad , and it uses single images.

hope I was clear

divyagupta25 commented 1 year ago

Thanks a lot for the clarification. I have a few follow up questions on the initialization of cluster centroids.

From the link that you have cited, they do not use this method to compute alpha (assigned manually) or centroids (initialized randomly)
Could you please tell how did you arrive at this? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L198 If I understand, here dots represents the dot product of each cluster centroid with each of the 50k descriptors. Why is it then sorted and the mean of the difference of the first two rows taken?
Why have you used the normalized centroids for initializing w_k https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L201 but the unnormalized centroids for initializing c_k? https://github.com/vandal-vpr/vg-transformers/blob/3947df2469d54aca7dfe3b6f6b5b22c242c1c41b/tvg/models/pooling.py#L199

ga1i13o commented 1 year ago

Hello, sorry for the delay.

although the credits for the NetVLAD are linked to this repo https://github.com/lyakaap/NetVLAD-pytorch/blob/7c1f62f1bec2e58c332aec5617a20b9d50482845/netvlad.py#L25, which was the first pytorch implementation of netvlad (the authors used matlab), the de-facto standard adopted for netvlad is actually this repo https://github.com/lyakaap/NetVLAD-pytorch/blob/7c1f62f1bec2e58c332aec5617a20b9d50482845/netvlad.py#L25, where you find the declaration of alpha and centroids. alpha is simply used to initialize the weights of the conv layer; I have never actually found a clear explanation of where it comes from but I gues it was experimental

the answer is the same to the other questions I guess

divyagupta25 commented 1 year ago

Thank you for your response! Both the links that you have given are the same. Is there any typo here?

ga1i13o commented 1 year ago

yes sorry, this is the right link https://github.com/Nanne/pytorch-NetVlad/blob/master/netvlad.py

divyagupta25 commented 1 year ago

Thank you so much for your help! Closing the thread

vandal-vpr / vg-transformers

Doubts in dataset loading #18