serizba / salad

Optimal Transport Aggregation for Visual Place Recognition
GNU General Public License v3.0
175 stars 19 forks source link

Discrepancy Between Reported Results and Reproduction Attempts for DinoV2+SALAD #6

Closed Ahmedest61 closed 6 months ago

Ahmedest61 commented 9 months ago

Hello,

I am reaching out for assistance regarding the reproducibility of the DinoV2+SALAD model as detailed in your recent publication. I have followed the training and evaluation pipeline provided in the repository and utilized the conda environment you provided. However, my results are not aligning with those reported in the paper, specifically in Table 3 for the DinoV2+SALAD model.

Steps to Reproduce:

  1. Repository cloned from the provided link.
  2. Environment set up using the provided environment.yml.
  3. Followed the training pipeline instructions in the README, with no modifications to the default parameters.
  4. Ran the main script to perform the training.
  5. Ran evaluation script to retrieve the results.

Expected Results: The reported results in your paper for DinoV2+SALAD:

Actual Results: The results I obtained were as follows:

Considering the differences in outcomes, I would like to inquire if there might be any additional configurations or parameters that were applied to achieve the results in the paper that may not be documented in the repository.

Your assistance in resolving these reproducibility concerns would be invaluable, not only for my understanding but also for the benefit of the community at large. I am looking forward to your response and any guidance you can provide.

Thank you for your time and consideration. hparams.txt metrics.csv results.txt

serizba commented 9 months ago

Hi @Ahmedest61

Hope this helps!

Ahmedest61 commented 8 months ago

Hey @serizba,

Thanks for your prompt and informative response. Your insights are greatly appreciated.

I have indeed utilized the provided checkpoint weights and can confirm that they produce results closely aligning with those reported. This step was instrumental in verifying the baseline performance of the system.

However, in my subsequent endeavors to train the SALAD system from scratch using the default configuration, I observed a noticeable discrepancy in performance. Specifically, there was an absolute reduction, e.g., 5% and 2% in Recall@1 for the Nordland and SPED datasets, respectively. This variance suggests a divergence from the expected outcomes based on the initial benchmarks.

Pursuant to your advice, I also experimented with training and evaluating the system at a higher image resolution (322x322) and with full precision settings. Despite these adjustments, the results mirrored the earlier findings, with lower Recall performance persisting for both the SPED and Nordland datasets.

These observations lead me to ponder if there might be additional nuances or configurations beyond the aggressive learning rate strategy and resolution adjustments that could potentially bridge the gap between the expected and actual performance metrics.

BinuxLiu commented 7 months ago

Hello @serizba,

The performance of checkpoint provided is very impressive, which confirms the results of the paper. But I encountered a similar problem to @Ahmedest61 during the replication training process, with almost the same experimental results as @Ahmedest61. __ SPED | Nordland The pre-trained model | 92.09 | 76.49
The replicated model | 90.94 | 70.07 DINO-NetVLAD(8192) | 90.60 | 70.10

I am more concerned about the two issues that arise from this: 1) The experimental results of our reproduced SALAD are similar to those of the NetVLAD method in the ablation experiment, making it impossible to determine which is better between SALAD and NetVLAD. 2) If an aggressive learning rate setting may lead to experimental differences, then the conclusion about how many layers of network to freeze is also questionable?

Looking forward to your answer.

serizba commented 7 months ago

Hi @BinuxLiu

Indeed, there is a bit of noise on the training, which may produce slightly different results for different runs. This is specially noticeable on Nordland and also happens with other models like MixVPR (as confirmed by the authors).

Regarding your points:

1) One of the advantages of SALAD is that it can easily allow for significantly smaller descriptors. We will soon update the camera ready version of the paper with results with smaller descriptors (512, 2048). As shown the Ablations table, NetVLAD quickly looses performance when using a dimensionality reduction. 2) We base our conclusions on our own empirical observations. Although multiple runs of the methods may bring subtle differences.

Best

BinuxLiu commented 7 months ago

Hi @serizba

Thanks for your prompt and informative response.