Performance Reproduction Problem

niki-amini-naieni / CountGD

Includes the code for training and testing the CountGD model from the paper CountGD: Multi-Modal Open-World Counting.

MIT License

73 stars 8 forks source link

Performance Reproduction Problem #16

Open eeeric-code opened 11 hours ago

eeeric-code commented 11 hours ago

Hi, it is a great work! When reproducing this project, I am able to achieve performance close to that reported in the paper on the FSC147 val set. However, when testing on the FSC147 test set using the same checkpoint, we obtain MAE≈11 and RMSE≈100 (sam tt-norm=False, crop=False, remove-bad-exemplar=False), which is close to the paper's results MAE=10.92 and RMSE=99.58. However, we only observe MAE=7+ and RMSE=80+ (sam tt-norm=True, crop=True, remove-bad-exemplar=True), which is different from the paper's results MAE=5.74 and RMSE=24.09. Could you please advise on the potential reasons for this discrepancy? Despite some of the reproduction results are consistent.

niki-amini-naieni commented 10 hours ago

Hi thank you for your question. Is this for the pretrained checkpoint, or are you retraining the model?

niki-amini-naieni commented 10 hours ago

The results for the pretrained checkpoint should be identical to the paper results

eeeric-code commented 10 hours ago

I have retrained the model

eeeric-code commented 10 hours ago

With the retrained checkpoint, the results are close to paper's results when sam tt-norm=True, crop=True, remove-bad-exemplar=True. But results are different from paper when sam tt-norm=False, crop=False, remove-bad-exemplar=False. That's weird.

niki-amini-naieni commented 10 hours ago

Ah okay, got it. I have not released the training code yet, so I am not able to reproduce your results on my side. I, however, can still speculate. This may be due to a high variance caused by a couple examples in the test set with very high counts driving up the rmse. You can check this by omitting examples with greater than 900 objects when you calculate the test error. If this results in a significant improvement to the error, then this is probably the issue. To improve the robustness of the training code to this issue, you could apply the adaptive cropping to the early stopping code so that when the validation set error is evaluated during early stopping, adaptive cropping is applied. Right now, there is some other source of non-determinism, other than the seed, in the posted code, and adaptive cropping is not being applied during early stopping.

eeeric-code commented 10 hours ago

Thanks a lot! I will try it.

niki-amini-naieni commented 10 hours ago

RE: With the retrained checkpoint, the results are close to paper's results when sam tt-norm=True, crop=True, remove-bad-exemplar=True. But results are different from paper when sam tt-norm=False, crop=False, remove-bad-exemplar=False. That's weird.

The main results in the paper have sam_tt_norm=True and remove_bad_exemplar=True, so it is not weird. We report results without these options in the appendix pasted below:

niki-amini-naieni commented 10 hours ago

The influence of these options is described in the appendix here:

eeeric-code commented 10 hours ago

RE: With the retrained checkpoint, the results are close to paper's results when sam tt-norm=True, crop=True, remove-bad-exemplar=True. But results are different from paper when sam tt-norm=False, crop=False, remove-bad-exemplar=False. That's weird.

The main results in the papper have sam_tt_norm=True and remove_bad_exemplar=True, so it is not weird. We report results without these options in the appendix pasted below:

Sorry, I make a mistake in my previous response. It should be: With the retrained checkpoint, the results are close to paper's results when sam tt-norm=False, crop=False, remove-bad-exemplar=False. But results are different from paper when sam tt-norm=True, crop=True, remove-bad-exemplar=True.

eeeric-code commented 10 hours ago

The influence of these options is described in the appendix here:

I have noticed that and conducted some ablation study, but still can not find the factors affecting the results

niki-amini-naieni commented 10 hours ago

Yes, so try turning on these parameters during the early stopping procedure to improve the robustness of the method to these settings. To reduce the variance of the method in general, look at other sources of non-determinism in the code (other than the seed) using this link: https://pytorch.org/docs/stable/notes/randomness.html, and remove them as much as possible.

eeeric-code commented 9 hours ago

Thanks! Let me check.