Closed frischzenger closed 7 months ago
the image GlobalDescriptor was not implemented.
Is there a specific question? The GlobalDescriptor is currently a placeholder, there is no built-in mechanism to create one or even use it to find loop closures. It was introduced to help integration with https://github.com/MarvinStuede/cmr_lidarloop, so that GlobalDescriptor can be saved in the database, and then re-extracted afterwards for external loop closure detection.
I recently re-read previous papers to consider how to utilize the global descriptor, especially VLAD. It‘s mainly useful in two aspects, evaluating the similarity between nodes, and fast nodes retrieval.
Currently, the similarity is evaluated by https://github.com/introlab/rtabmap/blob/a0476af896f43361124be875925114305766b415/corelib/src/Signature.cpp#L234 Likelihood is calculated via similarity or TF-IDF. With global descriptors of two nodes, similarity can be obtained by directly calculating their scalar product. Especially when VLAD has been normalized, the result is actually cosine similarity.
Because VLAD can encode images into very small data, global descriptors for a large number of nodes can be kept in memory for fast retrieval. For the 4096-D VLAD output by NetVLAD or HF-Net, each global descriptor will occupy 16KB. So if there are 10,000 nodes, it would take 164MB to store their global descriptors. KNN search can be used to find nodes that may detect loop closure from LTM, or to filter nodes used to calculate likelihood from WM, similar to LoopGPS.
Another question is whether Bayesian filter is still necessary when using global descriptor. I currently believe that evaluating likelihood is more reasonable than using similarity directly. After all, it is a bit difficult to determine thresholds for different types of global descriptors. Likelihood can tell whether the similarity is truly significant, and eliminate unnecessary comparisons.
We can now keep global descriptors all in memory. However, the memory management of RTAB-Map and VLAD make it possible to challenge extremely large-scale scenarios (such as city-level) in the future. By then we may also consider memory management of global descriptors.
The Bayes filter is still useful to filter spurious high likelihood (false positives), i.e., we need some consecutive high likelihood in an area to trigger a loop closures/re-localization.
I saw your pull request https://github.com/introlab/rtabmap/pull/1255, I integrated the changes to https://github.com/introlab/rtabmap/pull/1163 (which I just updated with latest master). You can give a try. It seems to work, I will merge it like this for now after CI is happy. The current issue is that the resulting loop closure hypotheses are low, so they not trigger loop closures. However, the highest hypothesis seems to be the right one. Here an example (likelihood computed with NETVLAD using your similarity approach):
Here is the full result using the sample dataset (on left is with NETVLAD and on right is with TF-IDF BOW approach):
We can see that the highest hypotheses are pretty much the same between the 2 approaches, though the actual hypothesis value is lower with netvlad. Would need to spend more time to see why and maybe adding a scaling factor on similarity can make the best hypothesis more higher in comparison to the others. This could be related to how we compute the likelihood for the "no loop closure" likelihood (see this).
Command used:
$ cd workspace/rtabmap/build/bin
$ ./rtabmap-console \
--Mem/GlobalDescriptorStrategy 1 \
--Kp/TfIdfLikelihoodUsed false \
--Mem/RehearsalSimilarity 1 \
--PyDescriptor/Dim 4096 \
--PyDescriptor/Path ~/workspace/netvlad_tf_open/python/rtabmap_netvlad.py \
../../data/samples
Python script to extract netvlad: https://github.com/introlab/rtabmap/blob/pydescriptor/corelib/src/python/rtabmap_netvlad.py
Glad to see the result. I marked #1255 as draft because I haven't tested it yet. The VLAD output of HF-Net on the OAK camera is still incorrect. So I originally wanted to use it to verify the model's output first. Since NetVLAD is also available, I will test it on Jetson later.
The current issue may be due to the fact that we are using cosine distance to evaluate similarity (All about VLAD, Section 3), while the loss during NetVLAD training uses Euclidean distance. The Euclidean distance should range from 0 to 2, when the descriptors have been normalized. In addition, the range of cosine similarity is actually -1 to 1, while Signature::compareTo() wants 0 to 1. 0 means no correlation between locations, while -1 means negative correlation. I can't imagine what kind of locations would be negatively correlated:> But cosine similarity is also proportional to the square of the Euclidean distance, which seems easier to understand. Obviously the similarity evaluation here needs to be adjusted. You can try to see which formulation is more reasonable. Perhaps we should also check the distribution of calculated similarities.
If cosine similarity works, it can then be rewritten as matrix-vector multiplication. This allows efficient one-to-many similarity evaluation.
I'll give a try using the euclidean distance as a similarity measure and see if there is a big difference. From the original paper:
At test time, the visual search is performed by finding the nearest database image to the query, either exactly or through fast approximate nearest neighbour search, by sorting images based on the Euclidean distance d(q, Ii) between f (q) and f (Ii).
Reading the paper you linked I see why you used the scalar product:
The similarity between VLAD descriptors is measured as the scalar product between them, and this decomposes as the sum of scalar products of aggregated residuals for each coarse cluster independently. [...] Thus, the similarity measure induced by the VLAD descriptors is increased if the scalar product between the residuals is positive, and decreased otherwise.
But yeah, we may indeed cannot use the scalar product directly, but would need to rescale it between 0 and 1.
But cosine similarity is also proportional to the square of the Euclidean distance
Σ((Xna - Xnb)^2) = Σ(Xna^2) + Σ(Xnb^2) - 2Σ(Xna * Xnb) = 1 + 1 - 2 cosθ = 2 - 2 cosθ
Thanks for the equations. I compared L2 distance (rescaled between 0 and 1) versus dot product (rescaled between 0 and 1) and they are indeed proportional:
I updated the PR with the rescaled dot product version. The advantage of the the dot product over L2 is that it is much faster to compute (no sqrt
). In both cases the highest hypothesis looks fine, however because the mean
is so much larger than std
, we get very large likelihood for the new place
hypothesis computed by this equation:
https://github.com/introlab/rtabmap/blob/11adbdcc9f4edcf047e2f0c147ea6f888d780074/corelib/src/Rtabmap.cpp#L5291
From this paper:
For the new location probability St = −1, the likelihood is evaluated using (4) : p(Lt|St = −1) = L(St = −1|Lt) = μ/σ + 1 . If L(St = −1|Lt) is high (i.e., Lt is not similar to a particular location in WM, as σ < μ), then Lt is more likely to be a new location.
Here is the comparison of the raw likelihood (and effect on adjusted likelihood and bayes filter posterior) between netvlad dotproduct similarity, direct local features similarity (pairs/totalWords) and TF-IDF approaches on NewCollege dataset for a specific image.
NETVLAD dot product similarity:
Direct local features similarity (pairs/totalWords):
TF-IDF:
In conclusion, with global descriptors, we would need to make similarity less similar when images are not taken at the same place. This would decrease the mean
and increase the std
, so that adjusted likelihood for the new place
would be around 2<->6 (like local descriptors approach) instead of 15<->20 (suppressing loop closure scores significantly). The current assumption for the likelhood is that std
would be higher than mean
if the place is similar only to a particular place. For netvlad similarity with mean
>std
, it means the current image looks a lot like any other images in the dataset, so probably best not loop on it as not enough discriminating. I guess at this point I would need to read more about how people used it for loop closure detection.
In conclusion, with global descriptors, we would need to make similarity less similar when images are not taken at the same place.
This is also the expected behavior resulting from the learning goal. According to the NetVLAD paper, Section 4, the goal is to make the distance between the query image and the positive smaller than the distance to the negatives. But we also care about how different these distances are. This is controlled by margin m in the loss function. Appendix A mentions m = 0.1. This makes the difference in similarity less significant than calculated by other methods.
I was checking this other paper, and we can observe the same results I saw earlier (too similar to all other descriptors) and they also found that was a problem. They did a PCA on some images smilar to the dataset to know which dimensions are more discriminative in the descriptor, then just do similarity with them.
From:
To:
Left is the similarity matrix between all images of the dataset, and on right a sample of an image compared to all others. There is a bigger difference after doing their PCA approach.
From what I understand, NetVLAD descriptor is already the resulting PCA (normalized), so first value is the most discriminative dimension. It is why we can take only 128 first values of 4096 and have similar performance. In that paper, thy used a particle filter to smooth the detections, but they kinda hard-coded the minimum distance between descriptors to be considered as possible loop closures.
In most loop closure detection papers I've seen using global descriptors, they generally check if distance is over a fixed threshold, then geometrically verify that the best match works, then a loop closure is detected. For example, in Kimera-VIO-NetVLAD, they seem to do the same even with their BOW approach.
Another paper that seems to have decent loop closure results, though the hypotheses selection is quite different than what we do in BOW. The results seem to vary more between datasets. They suggest to have a robust back-end to ignore false positives.
I'll merge the PR, while not yet super useful in rtabmap like this for loop closure detection, one could still enable NetVLAD for localization, if we assume that the robot is always in the map, just always test the best hypothesis even if it is low.
GlobalDescriptor was not implemented.