shaido987 / riskloc

Implementation of RiskLoc, a method for localizing multi-dimensional root causes.
115 stars 20 forks source link

hotspot方法:关于PS度量因置信度的可解释性 #5

Open mambasmile opened 2 years ago

mambasmile commented 2 years ago

大佬您好,PS方法采用RE(涟漪效应)来度量因的置信度,如何理解PS方法的原理

image

很多人的猜想类似于下面的: 如果属性值是因 , 属性值的变化和属性值样本的变化符合涟漪效应; 如果属性值的变化和属性值样本的变化符合涟漪效应,则属性值是因

这种理解对么

shaido987 commented 2 years ago

Hello,

Although I know a bit of Chinese, I'm in no way fluent so I will answer in English.

Following the ripple effect property, we know that:

So we know that the above are properties of the true root cause. The problem now is to find which set of elements is the root cause. To do this we need to search through sets of elements and measure their likelihood of being the root cause (HotSpot uses the PS score to do this).

What is done in HotSpot is 1) Assume a set of elements S is the root cause. 2) Change the real/actual values of all descendant leaf elements, i.e., the forecasting error in S is proportionally applied to the leaf elements. If S has a 20% forecast error then all leaf elements also have a 20% error. 3) If the adjusted values (a in the formula) are close to the actual values of the leaf elements (v), then S has a high potential score (PS). In the case where a == v, the distance between the two d(v,a) will be 0 and the PS score will be 1.

The key idea is that a root cause in multi-dimensional data like this will affect all the descendant elements evenly. This is what the PS score (and GPS in Squeeze, NPS in AutoRoot, and partly the risk score in RiskLoc) try to measure.

I hope the above helped a bit in understanding. If you have an interest in this work, consider staring the github repository.

mambasmile commented 2 years ago

thanks

but there is a situation in reality, where S decreases by 20%, but e does not necessarily decrease by 20%, so the ripple effect has certain limitations. Do you know what scenarios the ripple effect is suitable for?

shaido987 commented 2 years ago

I assume e is a leaf element of S? Since S decrease by 20% then these 20% need to come from somewhere, this somewhere is the leaf elements of S (since those build up S together). For S to have a forecast error of 20% then the leaf elements (as an aggregate, i.e., together) must also have have forecast error of 20% due to the nature of the multi-dimensional problem.

If S is a root cause of an anomaly, then the leaf elements will have its forecasting error evenly distributed following the ripple effect. If the forecasting error is more randomly distributed among the leaf elements then its less likely that S is the root cause. The above is also the asusmption of the ripple effect. So it's suitable in situations where you believe that prediction errors in the root cause elements will be evenly distributed (in practice this seems to work quite well).

In practice, I found that the most difficult step is to get accurate forecasting values for all leaf elements. Since these are usually quite fine-grained, they don't actually have much data and any forecasts are often inaccurate. This can skew the results.

mambasmile commented 2 years ago

thanks for your answer If an attribute value is the root cause and drops by 20%, the sample corresponding to the attribute value should change evenly by 20%, which belongs to the ripple effect theory I personally think that the generality of this theory is not particularly strong.

For example, the following figure shows that province=Beijing is the root cause. The KPI corresponding to province=Beijing has dropped by 40%. The first sample (Province=Beijing, ISP = Mobile) has dropped by 60%, while the second sample (Province=Beijing, ISP = Unicom) does not change, the ripple effect does not hold here

image

shaido987 commented 2 years ago

Actually, I would say that it does work however the true root cause is not Beijing. In the example, (beijing, unicom) is normal so it does not make much sense to say that the whole (beijing, *) is abnormal. Instead, the root cause that best explains the anomaly should be (beijing, mobile). Note that both (shanghai, mobile) and (guangdong, mobile) are normal so the root cause won't be (*, mobile).

So, even if the (beijing, *) had dropped 40% and should by itself be considered abnormal the location of the problem is actually the Mobile ISP in Beijing.