Open subeom527 opened 9 months ago
Yes, I agree with your points. Adding to what you listed above, one thing I found is that with the evaluation method used in the work, even a (any) random model (randomly initialized model without training!) can yield a good performance as well. You can test it simply by commenting out the torch.load lines for loading checkpoint model in the test function.
Plus, leveraging the anomaly labels from the "testing dataset" when computing anomaly threshold seems to be simply just wrong regardless of whether some previous works did it in the same way or not (it cannot justify the validity of the method when the method looks obviously faulty). Besides, in the used data_loader.py code, the validation set is simply equal to test set (except for SMD data). That is, there is no actual validation set used unlike how it was explained in the paper.
Unfortunately, this problem is being propagating in the community as I found a few other works that adapt this evaluation method for anomaly task, where exactly the same issue as this work are found in as well. I believe the authors of the works have noticed the incorrectness of the method.
Be aware of the works (I found two below) that followed the same evaluation method. "TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis", ICLR 2023. - (same author) "One Fits All: Power General Time Series Analysis by Pretrained LM,", NeurIPS, 2023.
I agree with your point of view; the arguments in the references mentioned by the author are not sufficient. Given the rigor of scientific research, I believe this method should not be continued.
The views in the original text are as follows:
“”“ In real applications, the human operators generally do not care about the point-wise metrics. It is acceptable for an algorithm to trigger an alert for any point in a contiguous anomaly segment, if the delay is not too long. Some metrics for anomaly detection have been proposed to accommodate this preference, e.g., [22], but most are not widely accepted, likely because they are too complicated. We instead use a simple strategy: if any point in an anomaly segment in the ground truth can be detected by a chosen threshold, we say this segment is detected correctly, and all points in this segment are treated as if they can be detected by this threshold. Meanwhile, the points outside the anomaly segments are treated as usual. The precision, recall, AUC, F-score and best F-score are then computed accordingly. This approach is illustrated in Fig 7。 “””
This method achieves excellent performance primarily due to the utilization of 'detection adjustment' and 'softmax'. In 'solver.py', the author employs 'softmax' to compute the 'metric' in lines 319 and 280. This results in a high value close to 1 for each window, thereby causing each window to contain at least one timestamp with a notably large anomaly score compared to others within the same window. Consequently, within the 'pred' output, each window is likely to be flagged as containing at least one anomaly. Subsequently, when 'detection adjustment' is applied, the entire continuous anomaly sequence is labeled as anomalous. However, 'softmax' is not suitable in this context because it cannot effectively model the relationship between different timestamps within a window. Removing 'softmax' from lines 319 and 280 would lead to a significant decrease in performance. As an alternative example, consider a scenario where I designate one timestamp as an anomaly every 100 timestamps and apply 'detection adjustment'. Surprisingly, despite this simplistic approach, I still obtain highly satisfactory results.
This problem is very obvious and has been clearly pointed out by other questioners as well. https://github.com/thuml/Anomaly-Transformer/issues/4 You can easily see the big difference before and after adding the suspicious code called "detection adjustment" in the above link.
The author gives the following incomprehensible and unclear answer to this:
->The two papers you link to as evidence are not published in official journals, and an author is credited with both papers. If so, please provide review papers published in validated journals that support your claims. Even if you are right, if tuning to improve a model's low performance to this high is a practice in academia, then that practice should be eliminated.
-> Remember the real time industrial data has no label!! Especially, when it comes to the anomaly detection!