Comparing features with same dimensionality

andrefaraujo commented 5 years ago

Thanks for this very interesting paper, and congrats on the CVPR acceptance!

I have a follow-up question: I believe in your experiments the different compared feature descriptors have different dimensionalities. For example, DELF's default is 40D, while the D2-Net descriptor is 512D (and 128D for most of the other approaches), so this can have a large impact in the memory stored by the system. Did you perform experiments by using the same dimensionality for different descriptors? Since the performance of different techniques are quite close in the different experiments, I am wondering if this could be playing an important role here.

One quick way to try this out for DELF features is to tune the dimensionality, which can be done by simply changing the pca_dim parameter in the DELF config: https://github.com/tensorflow/models/blob/master/research/delf/delf/python/examples/delf_config_example.pbtxt#L22

mihaidusmanu commented 5 years ago

Regarding DELF, we have used the full 1024D descriptors (without PCA) throughout our evaluation since we noticed it yielded significantly better results on the difficult day-to-night situations from the Aachen dataset. For the other methods, all of them are 128D except for SuperPoint which is 256D. We do agree that the difference in dimensionality can give an edge to certain methods over the others, but so far we didn't look at ways to reduce the dimensionality of D2-Net descriptors (and some other methods don't provide it either). Thus, we evaluated the 'highest performance' each method (in its current state) can achieve by using the full descriptors. I will add the different dimensions when releasing the updated arXiv version of the paper!

As for the benchmarks, we have already released the code for Aachen Day-Night evaluation at the following link: https://github.com/tsattler/visuallocalizationbenchmark/tree/master/local_feature_evaluation. During the following weeks, I also plan on releasing a slightly modified version of the InLoc demo code (https://github.com/HajimeTaira/InLoc_demo) which can be used to run the benchmark with custom local features.

andrefaraujo commented 5 years ago

Ah, very interesting, thanks for clarifying! Since the dimensionality was not mentioned in the paper, I was assuming it was 40D for DELF since it's the default. It definitely sounds like something important to mention in your arXiv paper. Thanks for releasing the evaluation and demo code as well, these look quite helpful.

Regarding DELF, in our previous experiments we actually observed improved retrieval performance when reducing the dimensionality by PCA, compared to the original 1024D descriptors. This can be seen in an early draft of the paper we had published some time ago: Table 1 here. So now I am actually wondering if setting the DELF dimensionality to 512D would improve performance compared to the paper's results :) -- but this is unclear. You mentioned that you noticed that 1024D "yielded significantly better results". I am curious, do you have more details on that? (eg, what dimensionalities you tried for DELF, etc)

mihaidusmanu commented 5 years ago

Thanks a lot for your insights! I only tried with and without PCA (40D and 1024D), but I don't have the numbers at hand anymore. I will re-run the experiments with different dimensionalities for DELF and get back to you tomorrow.

andrefaraujo commented 5 years ago

Excellent, thanks! Looking forward to the results.

andrefaraujo commented 5 years ago

BTW, forgot to mention: we released a new pre-trained version of DELF last week: https://github.com/tensorflow/models/tree/master/research/delf#pre-trained-models (the one pre-trained on the Google Landmarks dataset). We see about ~4% mAP boost in our CVPR'19 paper compared to the previous model (which is the one you used). So if you are re-running DELF experiments, maybe you can directly use the newly-released model.

sarlinpe commented 5 years ago

It is possible that whitening improves the retrieval performance but dimensionality reduction impairs it. This was already observed for image retrieval by Radenović et al. (cc @filipradenovic) in Fine-tuning CNN Image Retrieval with No Human Annotation. As such, how and on which data the whitening is computed do matter too.

filipradenovic commented 5 years ago

Yes, PCA-learned whitening (without dimensionality reduction) might improve performance, but it doesn't always happen. In our Revisiting Oxford and Paris paper we used 128D for DELF to be directly comparable to RootSIFT, and some initial experiments showed it to be better than the default 40D. Unfortunately, I don't have those results at hand.

Here, it is said:

Regarding DELF, we have used the full 1024D descriptors (without PCA)

If you do have experiments at hand, I am curious to know how would be PCA-whitened 1024D perform. And, as @andrefaraujo said, there might be some benefits from dimensionality reduction to 512D or lower, as well. I have seen cases where dimensionality reduction improves performance.

mihaidusmanu commented 5 years ago

Sorry for the delay, I ran into some issues when benchmarking the different versions of DELF on the Aachen Day-Night dataset. My conclusion so far is that increasing the dimension correlates with a better performance overall and a more stable 3D model in this scenario (matching of DELF features using mutual NN => 3D reconstruction from day-time images => registration of the night-time image in the 3D model). Here are the results for different descriptor sizes with the old weights and no whitening (the method name gives the dimension of descriptors and the detection threshold):

[method=delf-40-25][0.5m,2deg]: 0.265
[method=delf-40-25][1m,5deg]: 0.490
[method=delf-40-25][5m,10deg]: 0.725
[method=delf-40-25][10m,25deg]: 0.827

[method=delf-128-25][0.5m,2deg]: 0.337
[method=delf-128-25][1m,5deg]: 0.582
[method=delf-128-25][5m,10deg]: 0.816
[method=delf-128-25][10m,25deg]: 0.939

[method=delf-256-25][0.5m,2deg]: 0.276
[method=delf-256-25][1m,5deg]: 0.602
[method=delf-256-25][5m,10deg]: 0.837
[method=delf-256-25][10m,25deg]: 0.949

[method=delf-512-25][0.5m,2deg]: 0.357
[method=delf-512-25][1m,5deg]: 0.571
[method=delf-512-25][5m,10deg]: 0.837
[method=delf-512-25][10m,25deg]: 0.959

# Reported in the paper
[method=delf-1024-25][0.5m,2deg]: 0.388
[method=delf-1024-25][1m,5deg]: 0.622
[method=delf-1024-25][5m,10deg]: 0.857
[method=delf-1024-25][10m,25deg]: 0.980

As I mentioned earlier, increasing from 40 to 1024 dimensional descriptors offers a 10+% boost in performance. 512 dimensional descriptors performed close, but the obtained models were significantly more unstable - the results varied by as much as 5% between different runs. I suspect that this is due to a higher false-positive rate, but I'll have to investigate further. Please note that this is not the case for any other method, most of them varying by less than 1-2% due to the randomness in the RANSAC process.

I will evaluate different versions of DELF (including the new weights) on the HPatches Image Pairs dataset when I find the time and update you with the results! That one is deterministic so I expect the difference to be more clear.

andrefaraujo commented 5 years ago

Excellent, thanks so much for these experiments! Really good to know the performance for these variants.

Just to confirm: these results are all obtained with the first version of the DELF released model, right? (ie, not the one we released recently = the one you used in the paper)

mihaidusmanu commented 5 years ago

Yes - these results (as well as the ones from the paper) are with the first version of DELF. I will download the new model and try it out. Did you notice any changes regarding the number of keypoints / detection scores with the new weights? From what I can see in delf_config_example.pbtxt the default parameters were kept the same.

andrefaraujo commented 5 years ago

Yes, we tend to see a larger number of features detected with the most recent model. Config parameters are unchanged indeed.

vimalthilak commented 5 years ago

@mihaidusmanu Would it be possible for you to add results for the HPatches Sequence pairs test where you keep the number of detected features comparable for all algorithms? For instance, I'm very curious to see what the results look like when the number of detected keypoints is about 2K or so for all algorithms.

mihaidusmanu commented 5 years ago

@vimalthilak I slightly modified the code of the benchmark to allow for top K features only. I only tested trained D2-Net and Hessian Affine + Root SIFT for the moment. The performance of D2-Net (singlescale) is roughly the same even when considering top 2K features only, but it gets worse for 1K (by ~5% MMA). I will try to release cached results for this as well, but I can't give you an ETA for that right now.

vimalthilak commented 5 years ago

@mihaidusmanu Thanks for the update. Very useful information and I'm looking forward to the cached updates.

mihaidusmanu commented 5 years ago

I am sorry for the delay.

@vimalthilak I have uploaded the cached results with top 2K features for most methods (except LF-Net - where selecting more than 500 keypoints worsens the results - and D2-Net multiscale - where picking top 2K features based on activations doesn't work properly due to the summing of feature maps).

@andrefaraujo I added the cached results for the new DELF model on HPatches Sequences. For this task the results didn't change with the new model. All experiments were run with full descriptors and I tried to tune the detection threshold in order to get a similar number of features as before.

Here are the final results:

I will close this issue since all points have already been addressed. Feel free to open a new issue in case there are any other suggestions / problems.

cumtchenchang commented 5 years ago

@andrefaraujo, please forgive me for not open a new issue becasue I think many researchers are very concerned about the dimensions you mentioned. I am very interested in the results of Illumination and Viewpoint datasets about DELF on different dimensions, such as 40, 128, 256 and 512. If impossible, could you provide more details?

mihaidusmanu / d2-net

Comparing features with same dimensionality #1