Open ZhihuangLi1221 opened 2 years ago
I also have similar questions with you. whether there is a more resonable way to set the param 'n_remove'? Looking forward to the author's reply
Hello @ZhihuangLi1221 and @chaochaobar ,
Thanks for your interest.
n_remove
is used to remove some outliers in the deviation scores to get a reasonable cutoff point. This cutoff point is then used to partition the data into an abnormal and a normal part. This way of finding the cutoff point assumes that the normal data is relatively evenly distributed around 0, with a few possible outliers that n_remove
handles.
Illustrative example (blue dots are normal data while the three colors are concurrent anomalies with different root causes):
In the figure above, the dashed green line represents the minimum deviation score with 5 outliers removed (i.e., using n_remove=5
) while the dashed red line is the maximum deviation score with outliers removed (also 5).
Since the minimum absolute value is smaller than the maximum absolute value, we determine that the anomalies are on the right-hand side of the plot (i.e., the real values of the anomalies are below the predicted values). The cutoff point is then the negation of the minimum value (green solid line in the figure). You can refer to Algorithm 1 in the paper.
For your questions:
Adjusting n_remove
will remove outliers when computing the cutoff point but will not affect data points on the partition deemed abnormal (so it will consider all data point to the right of the solid green line in the figure above when localizing the root cause). So you can't use n_remove
to remove any rare anomalies/data points. Instead, you can try to increase the pep_threshold
(proportional ep_threshold). Increasing this will only return anomalies that have a higher explanatory power which should remove smaller anomalies.
Alternatively, if you only want to consider larger aggregated elements (and not very fine-grained/specific anomalies), you could adjust the code to only run n layers deep by setting a maximum value here: https://github.com/shaido987/riskloc/blob/cf1531e28c9a978b6e7b119325cb5cf1c3563dd8/algorithms/riskloc.py#L97-L98
Or, if you have some knowledge of what points should be removed, you can remove these as a preprocessing step before running riskloc.
Yes, the distribution of the normal data's deviation scores will affect the result. The computation of the cutoff point is done with the assumption of the deviation scores being relatively evenly spread around 0. You could plot a similar figure to the one above to investigate if the obtained cutoff point is reasonable or not and if not, how it needs to be adjusted.
I created an example where the normal data has a long-tail in the positive direction:
As you can see, the cutoff point is too conservative and a lot of the normal data points will be considered when computing the potential root cause which may affect the accuracy. You could look at the clustering methods used in Autoroot and Squeeze (KDE clustering / using a histogram) and try to adapt them to return a single cutoff point to see if they work better for your data. Or as a first step, you can set a fixed value (e.g., 1 in the long-tailed data figure above).
Hi @shaido987 ,
Really appreciate your reply, it helps a lot.
Hi,
I hope you are well.
When I used riskloc in my dataset, I noticed that it can precisely found the root cause. However, my purpose is to find those anomalies that occur more frequently, so I would consider those rare root causes I found would be some outliers. Then I tried to increase the value of "n_remove" , but still not got my expected result.
Also, when I decrease the "n_remove" to 1, the "cutoff" value shifted a lot, and the output return null. When I do the same thing in another dataset, the result was not affected. I compared the distributions of measurements of 2 datasets, the first one is more like normal distribution, the second one is like long-tailed distribution.
Here are my questions:
I am looking forward to your reply.