salesforce / Merlion

Merlion: A Machine Learning Framework for Time Series Intelligence
BSD 3-Clause "New" or "Revised" License
3.43k stars 302 forks source link

[FEATURE REQUEST] Unsupervised Evaluation Metrics #131

Open anton164 opened 2 years ago

anton164 commented 2 years ago

Is your feature request related to a problem? Please describe. The current evaluation metrics in evaluate/anomaly.py assume that a ground truth available. However, in many time series anomaly detection problems there is no ground truth.

It would be great if the Merlion evaluation base classes were more general and supportive of this use-case. As of now we effectively have to implement our own evaluation methods.

Describe the solution you'd like I think ideally methods/classes such as TSADEvaluator.evaluate, TSADScoreAccumulator and accumulate_tsad_score should not assume that there is a ground truth - other interfaces in the Merlion package typically take test labels as an optional argument. Similarly, the evaluation classes should be able to compute unsupervised descriptive statistics if a ground truth is not passed.

aadyotb commented 2 years ago

@anton164 Thanks for the comment. You highlight a fundamental challenge with anomaly detection -- often, ground truth labels are unavailable. But in my experience, the most common metrics people use to evaluate anomaly detection algorithms are the ones supported in Merlion, all of which require ground truth labels. If you have (1) specific unsupervised metrics in mind, and (2) a compelling use case for them, you are welcome to open a pull request adding them to the repo, and I can review it. But for the time being, I'm not sure how useful these unsupervised metrics would really be.

anton164 commented 2 years ago

Thanks for your prompt reply @aadyotb. I might do that to demonstrate what I mean. Which classes would you recommend me to extend for that demonstration? From a design perspective the TSADEvaluator which does historical analysis is "coupled" to ground truth label evaluation, so maybe I'll implement another version of that which isn't.

From an unsupervised perspective it would be useful to have a simple way to evaluate the following metrics:

As you point out - GT labels are often unavailable, so its surprising to me that Merlion which promises to be a complete framework for TS anomaly detection does not have any guidance here. Happy to try to incorporate some ideas :)

One flow I would like to support is self-supervision using Merlion:

  1. Run unsupervised detection using a suite of simple models
  2. Compare metrics & inspect detected anomalies to identify the unsupervised detector that is the best starting point
  3. Treat the "best" predictions in step 2 as a fuzzy ground truth and tune an advanced model
aadyotb commented 2 years ago

Thanks for clarifying. From an implementation perspective, I'd suggest leaving TSADEvaluator unchanged. It would probably be much simpler to just extend TSADScoreAccumulator. For the specific metrics you mention, you might even be able to get away something like accumulate_tsad_score(ground_truth=scores, predict=scores) and just examine the true positive/negative statistics accumulated.

For distribution statistics, one potentially interesting direction would be to characterize the amount the test scores deviate from a standard normal distribution, since calibration reshapes the distribution of training scores to look like a standard normal (note that this is more sophisticated than mean/variance normalization). So if the test scores don't seem like they've been drawn from a standard normal, this could be an indicator of distribution shift over time.

I'm much more hesitant to support the self-supervised labeling approach. In practice, time series anomalies vary widely (raw spikes/dips, changes in trend, deviations from standard seasonal patterns, ...). When dealing with multivariate time series, things get even more complex. Simple models often either fail to detect these more complex anomalies, or have low precision when doing so. And in many cases, users care about detecting one type of anomaly but not another. Beyond getting actual labels (and even that can be controversial), I unfortunately don't have a great answer for this problem, and I haven't seen one in the literature either.

anton164 commented 2 years ago

Thanks for sharing your thoughts @aadyotb ! I will give it a try and report back once I have a demo in Merlion