openvinotoolkit / anomalib

An anomaly detection library comprising state-of-the-art algorithms and features such as experiment management, hyper-parameter optimization, and edge inference.
https://anomalib.readthedocs.io/en/latest/
Apache License 2.0
3.7k stars 655 forks source link

Why the models' AUC is lower than the AUC on the paperwith code website #540

Closed AndSonder closed 2 years ago

AndSonder commented 2 years ago

Why the models' AUC is lower than the AUC on the paper with code website

paperwithcode url: https://paperswithcode.com/sota/anomaly-detection-on-mvtec-ad

I found that most of the models can't reach the AUC on the paperwithcode website.

wooramkang commented 2 years ago

Benchmarks are generally useful but the numbers on papers are not independently meaningful. what makes the numbers meaningful is the comparisons between numbers and their methods. Therefore, if you really duplicate the numbers on experiments of the papers that the model you want to use is relevant, you have to set your experiments up as same as what they set up, eg. Dataset, hyper-params, epoches, so on, so on. But it is hardly possible because they can put some random noises and also they divide train/test/validation datasets randomly.

Conclusion : It is possible to have different numbers from the numbers of the papers you read. and it is normal.

djdameln commented 2 years ago

As @wooramkang pointed out, it is often difficult to reproduce the exact numbers reported in a paper.

Differences in model and dataset initialization and the random nature of neural network training make it close to impossible to reproduce the exact outcome of an experiment, especially when there may be small implementational differences as well.

In addition, differences in hardware and software configuration may lead to small numerical differences in the outputs of the models. The model design and hyperparameter configuration chosen by the original authors will be optimized towards their specific setup, and this may not necessarily be optimal for other configurations. This may be one reason why we often see slightly lower performance numbers compared to the original paper.