[RFC] Semi-supervised anomaly detection

kaituo commented 2 years ago

What problems are you trying to solve?

The anomaly detection (AD) plugin creates machine learning (ML) models that can adapt to changes in data distribution over time without modifying the code. But few machine learning (ML) models are perfect. Misclassifications of anomalies falsely alert customers to non-issues or unnecessarily missing critical events. Also, anomalies are personalized. An anomaly from one person might be normal for another person. It would be nice if a user could mark anomalies as false positives or undetected to equip the AD plugin with fine-grained domain knowledge. With feedback for anomaly detection, this kind of misclassification can be communicated to the anomaly detection system. The system can adjust accordingly so that such misclassifications are less likely to occur. OpenSearch customers cannot provide feedback to the anomaly detection (AD) plugin when a detector is running.

What are you proposing?

Semi-supervised anomaly detection (SSAD) sets up a domain-specific feedback loop. First, a user notices an error and lets AD know what is wrong. AD then triages the problem in multiple ways to improve model performance. Next, the user reviews the results of different changes on hyperparameters or feature weights. Finally, the loop iteration finishes when the user adopts a change, or nothing works.

What is the user experience going to be?

Users may not be sure that a particular detector configuration is what they want. They are trying to make sense of the underlying data space by experimenting with configurations, backtracking based on anomaly results, and rewriting their detector configuration, aiming to discover relevant anomalies. It is fundamentally a multi-step process with the user's interests specified in feedback, including

The model produces too many spurious warnings.
The model misses too many anomalies.
The data point is not an anomaly (false positive).
One anomaly is undetected (false negative).

The workflow of our feedback loop is as follows.

After running a detector through preview or real-time/historical analysis, AD presents a user anomaly results, and the user can review the results' relevance. For example, we can ask the user to quantify the number of false and undetected anomalies.

Q.1: How many false anomalies do you see? A: 0 B: 1~3 C: 3 above

If the user selects Q1.A, we then ask

Q.2: How many anomalies are undetected? A: 0 B: 1~3 C: 3 above

If the user selects Q1.B, we direct them to vote up-or-down a detection result and show revised result graphs. The user can modify her feedback on previously seen samples.

If the user selects Q1.C, it is not easy for the user to label each result they disagree with. The situation also indicates something fundamental about the whole detector is wrong. Therefore, we direct them to multiple result graphs derived from different retrained models. For instance, graph one is derived by increasing the detector interval, while graph two is derived by changing each feature's weight. We ask users which chart looks best and change detector configuration accordingly. If done right, a user can see more actionable anomaly detection results using improved models. Unfortunately, sometimes there is no good fix (e.g., noisy input data), and users need to reconfigure the detector by adding more filters on the data or using different features.

Why should it be built? Any reason not to?

The quality of an AD model depends on the data the model is trained on, and model hyperparameters. As a general anomaly detection pipeline, we don't make assumptions about the domain of data inputs and use parameter defaults applicable to everyday scenarios. Users can improve the quality of the AD model by exploring features or tuning parameters. But there are a lot of manual efforts and ML knowledge involved. SSAD is a more economical (or necessary for users without ML knowledge) way for users to partake in the model improvement.

What will it take to execute?

We need to set up a feedback loop process that assimilates customer feedback in the above user experience section. This project entails a new infrastructure to collect user inputs, quality check, and retrain models.

Addressing the too many/few anomalies feedback would entail the following changes.

On detector creation
- We suggest different intervals. Larger intervals may smooth out random noises and thus reduce the number of anomalies. On the other hand, smaller intervals can recover smoothed anomalies and thus increase the number of anomalies.
- We suggest different shingle sizes. Larger shingle size reacts more slowly to changes and thus less anomalies. Smaller shingle size reacts to changes faster and thus more anomalies.
- We suggest enabling serial differencing to remove linear trends and seasonal behavior to reduce anomalies.
- We suggest feature weights to reduce/increase anomalies or remove certain false positives or negatives.
- We suggest different sample size. Usually, when we want x% labelled as anomalies, we use 1/x% samples. For example, if 0.5% should be anomalies, we want to have 1/0.5% = 200 samples. When there are too many/few anomalies, we can increase/decrease sample size.
We allow dynamic adjustment of anomaly rate and sample time decay rate. It may not be immediately apparent that the sample time decay rate relates to the number of anomalies. However, the higher time decay rate, more recent samples are preserved, the model is reacting more to local than global context, we may thus detect more anomalies.

It is equally important to allow customers to label examples the model does poorly on. We then mark the positive or negative example and its nearest-K neighbors. The anomaly score of a query point will consider the scores of labeled neighbors.

We also provide customization for a high-cardinality anomaly detection (HCAD) detector that has multiple models. We show top anomalies in terms of grade or number of anomalies. But tracking anomalies across a large number of entities across time via user inspection is impractical. So it makes sense to consider consolidating related anomalies based on feature values and attributions. Users can view a live clustered picture of all anomalies across time. For example, CPU usage 98% of entity one and CPU usage 96% of entity two can stay in the same cluster and be graphed as one anomaly. The clustering is changing in real-time, and the number of clusters is not fixed by using a distance cutoff to determine a point's neighbor. In the clustered view, a user can select option like “don’t show me like this” or “show me more like this”. These options indicate users think there is a false positive or true positive. We can relay the feedback to entities in a cluster with similar feature values or attributions after getting customers' consent. We also post-process the patterns within a cluster and give insights in UX. For example, most entities within the cluster have the prefix a.b in their names or have attributes x.

What are the remaining open questions?

There are three open questions.

How to implement low-latency and high-volume label propagation for a labeled false positive or negative?
How to design a streamlined UX that prohibits intrusive popups that disturbs a user's activity?
How to relay feedback in an entity to other entities via clustering in HCAD?

Request for comments

We are looking for all forms of feedback! Some questions we would like to get your input on include.

What are your use cases of semi-supervised anomaly detection?
Do you have additional feedback beyond what is listed in this RFC?

elfisher commented 2 years ago

The model produces too many spurious warnings. The model misses too many anomalies. The data point is not an anomaly (false positive). One anomaly is undetected (false negative).

Are these hard to spot with high dimensionality?

After running a detector through preview or real-time/historical analysis, AD presents a user anomaly results, and the user can review the results' relevance. For example, we can ask the user to quantify the number of false and undetected anomalies.

Would it make sense to have the user select the false positives? Maybe also have a spot to highlight where a false negative happened?

xinlamzn commented 2 years ago

Next, the user reviews the results of different changes on hyperparameters or feature weights.

will the customer review the parameters or AD results from the updated parameters?

kaituo commented 2 years ago

The model produces too many spurious warnings. The model misses too many anomalies. The data point is not an anomaly (false positive). One anomaly is undetected (false negative).

Are these hard to spot with high dimensionality?

You meant our Hight-cardinality anomaly detector results, right? If yes, yeah those are harder as cx is dealing with many errors.

After running a detector through preview or real-time/historical analysis, AD presents a user anomaly results, and the user can review the results' relevance. For example, we can ask the user to quantify the number of false and undetected anomalies.

Would it make sense to have the user select the false positives? Maybe also have a spot to highlight where a false negative happened?

Yes, we will have the user select the false positive/negative in the next step.

kaituo commented 2 years ago

Next, the user reviews the results of different changes on hyperparameters or feature weights.

will the customer review the parameters or AD results from the updated parameters?

They will review AD results from the updated parameters.

opensearch-project / anomaly-detection

[RFC] Semi-supervised anomaly detection #562