Closed hit2sjtu closed 1 year ago
Congrats on noticing the issue! You are right, the difference in results is due to the authors of SFRS picking the best model on the test set, as they admitted here as well. So to get the results on our paper we used the code of the authors, following all their recommendations (i.e. starting from model weights from MatConvNet instead of Torchvision, and using the initial cluster centers for VLAD layer), and we performed a fair set of experiments (3 runs per experiment), without selecting the model on the test set.
Thanks for the clarification and now it is clear. I do hope more people working in this field should follow your recommendations on Reproducibility. It will show us what REALLY works and help the field advance.
I know the issue is closed but I want to add that I have trained SFRS on my own and obtain similar results as in your paper.
First I shall say really nice work. I have some questions about Table 3 in your paper.
For the SARE and SFRS model training, did you use the author's repo or your own re-implemented version (which I did not find in the codebase)? The model in the SFRS author's paper is also trained on pit30K and tested on tokyo247 dataset but the results are better. I am wondering whether this is re-implemented side issue or due to "we followed deep learning's best practices (average over multiple runs for the main results, validation/early stopping and hyperparameter search on the val set)". It seems the SFRS author did pick the best model, as shown https://github.com/yxgeee/OpenIBL/issues/2#issuecomment-686952506
Thanks.