Wrongly valid on test_loader. Unfair evaluation.

thuml / Anomaly-Transformer

About Code release for "Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight), https://openreview.net/forum?id=LzQQ89U1qm_

MIT License

758 stars 199 forks source link

Wrongly valid on test_loader. Unfair evaluation. #27

Closed dqgdqg closed 2 years ago

dqgdqg commented 2 years ago

It seems like the validation is on test_loader while not vali_loader, which is unfair to some extent and would make the results a little bit different.

https://github.com/thuml/Anomaly-Transformer/blob/72a71e5f0847bd14ba0253de899f7b0d5ba6ee97/solver.py#L196

Moreover, directly using thre_loader to find thresholds would cause test datasets leakage, since thre_loader is built on test_data while not valid_data.

https://github.com/thuml/Anomaly-Transformer/blob/72a71e5f0847bd14ba0253de899f7b0d5ba6ee97/solver.py#L254

https://github.com/thuml/Anomaly-Transformer/blob/72a71e5f0847bd14ba0253de899f7b0d5ba6ee97/data_factory/data_loader.py#L66-L69

wuhaixu2016 commented 2 years ago

Thanks for your question!

Yes, you can split the dataset into three subsets, train, val and test. And then change this dataloader as valid_loader.

In our code, the valid loss only affects the early stop. (1) All the baselines use the same early stop strategy. Fair comparison. (2) All the experiments do not use the early stop at all. All the training processes will continue to the last epoch. Thus, you can view our code as "we use the same hyper-parameters without any extra information from validation set". Fair evaluation.

dqgdqg commented 2 years ago

Thanks for your prompt reply!

In your code, you split the dataset into three subsets: train, val, and test. Actually, only train and test are used in your code, while the valid set is not used at all.

What's my concern is that the validation, the testing and the threshold selection are all evaluated on the test set in your code. It would be better if the validation and the threshold selection are evaluated on the valid set or whatever set not overlapped with the test set.

Please correct me if I misunderstood. Thanks.

wuhaixu2016 commented 2 years ago

Yes, you are right. It is better if the validation and the threshold selection are evaluated on the validation set. You can re-split the train set that we used for train and valid sets.

(1) Since our paper focuses on unsupervised settings, we merge train and valid at last to enlarge the dataset. (2) You can also select the threshold on the train set. Actually, the final results will be the same whatever which subset you used for threshold selection.

dqgdqg commented 2 years ago

Thanks for your explanation.