shaido987 / riskloc

Implementation of RiskLoc, a method for localizing multi-dimensional root causes.
MIT License
120 stars 21 forks source link

Question about the value of "n_remove" in riksloc #11

Open ZhihuangLi1221 opened 2 years ago

ZhihuangLi1221 commented 2 years ago

Hi,

I hope you are well.

When I used riskloc in my dataset, I noticed that it can precisely found the root cause. However, my purpose is to find those anomalies that occur more frequently, so I would consider those rare root causes I found would be some outliers. Then I tried to increase the value of "n_remove" , but still not got my expected result.

Also, when I decrease the "n_remove" to 1, the "cutoff" value shifted a lot, and the output return null. When I do the same thing in another dataset, the result was not affected. I compared the distributions of measurements of 2 datasets, the first one is more like normal distribution, the second one is like long-tailed distribution.

Here are my questions:

  1. Is adjusting n_remove a way to do what I expect? If yes, is there some more reliable way than setting constants arbitrarily?
  2. Does the distribution of the measurements range affect the performance of the algorithm?

I am looking forward to your reply.

chaochaobar commented 2 years ago

I also have similar questions with you. whether there is a more resonable way to set the param 'n_remove'? Looking forward to the author's reply

shaido987 commented 1 year ago

Hello @ZhihuangLi1221 and @chaochaobar ,

Thanks for your interest.

n_remove is used to remove some outliers in the deviation scores to get a reasonable cutoff point. This cutoff point is then used to partition the data into an abnormal and a normal part. This way of finding the cutoff point assumes that the normal data is relatively evenly distributed around 0, with a few possible outliers that n_remove handles. Illustrative example (blue dots are normal data while the three colors are concurrent anomalies with different root causes):

hard_example_edited_multi_line

In the figure above, the dashed green line represents the minimum deviation score with 5 outliers removed (i.e., using n_remove=5) while the dashed red line is the maximum deviation score with outliers removed (also 5). Since the minimum absolute value is smaller than the maximum absolute value, we determine that the anomalies are on the right-hand side of the plot (i.e., the real values of the anomalies are below the predicted values). The cutoff point is then the negation of the minimum value (green solid line in the figure). You can refer to Algorithm 1 in the paper.

For your questions:

  1. Adjusting n_remove will remove outliers when computing the cutoff point but will not affect data points on the partition deemed abnormal (so it will consider all data point to the right of the solid green line in the figure above when localizing the root cause). So you can't use n_remove to remove any rare anomalies/data points. Instead, you can try to increase the pep_threshold (proportional ep_threshold). Increasing this will only return anomalies that have a higher explanatory power which should remove smaller anomalies.

    Alternatively, if you only want to consider larger aggregated elements (and not very fine-grained/specific anomalies), you could adjust the code to only run n layers deep by setting a maximum value here: https://github.com/shaido987/riskloc/blob/cf1531e28c9a978b6e7b119325cb5cf1c3563dd8/algorithms/riskloc.py#L97-L98

    Or, if you have some knowledge of what points should be removed, you can remove these as a preprocessing step before running riskloc.

  2. Yes, the distribution of the normal data's deviation scores will affect the result. The computation of the cutoff point is done with the assumption of the deviation scores being relatively evenly spread around 0. You could plot a similar figure to the one above to investigate if the obtained cutoff point is reasonable or not and if not, how it needs to be adjusted.

    I created an example where the normal data has a long-tail in the positive direction: long_tail

    As you can see, the cutoff point is too conservative and a lot of the normal data points will be considered when computing the potential root cause which may affect the accuracy. You could look at the clustering methods used in Autoroot and Squeeze (KDE clustering / using a histogram) and try to adapt them to return a single cutoff point to see if they work better for your data. Or as a first step, you can set a fixed value (e.g., 1 in the long-tailed data figure above).

ZhihuangLi1221 commented 1 year ago

Hi @shaido987 ,

Really appreciate your reply, it helps a lot.