microsoft / Kusto-Query-Language

Kusto Query Language is a simple and productive language for querying Big Data.
Apache License 2.0
562 stars 104 forks source link

Seeking Clarification: Calculation of outlier score in series_outlier method #136

Closed bhavnajindal closed 5 months ago

bhavnajindal commented 10 months ago

I'm integrating KQL into ClickHouse(https://github.com/ClickHouse/ClickHouse) and working on implementing the series_outlier method. The documentation mentions the use of "Tukey's Fences"/IQR to detect the outliers. While the documentation outlines the overall process, it lacks information on how the anomaly score is computed within this method.

Understanding the scoring algorithm is crucial for my implementation, and I believe it would greatly benefit the open-source community as well. Could you kindly provide more details regarding the calculation of scores within the series_outlier method.

adieldar commented 6 months ago

series_outliers() calculates the outliers score using a customized version of Tukey’s fence test, but instead of using the standard 25th and 75th percentiles for calculating the IQR (inter quantile range), we take 10th and 90th percentiles (or let the user specify both). The motivation for this is that in some scenarios we search for outliers in a metric where all values but the outliers are constant, so percentile(75) == percentile(25) resulting in zero IQR that cannot be used for calculating the outliers score (due to division by 0). Thus, we prefer to extend the low/high percentiles as much as possible but avoid taking outliers into the calculation of the IQR. Taking 10th and 90th percentiles still allows up to 10% anomalies on each side, which is wide enough for most scenarios as usually we expect a much lower likelihood of outliers.

Though we take different low/high percentiles we still want to get a score which is invariant to these borders, e.g. get the same K=1.5 for mild outliers for a metric whether we use 25th/75th or 10th/90th percentiles. For that reason, we need to normalize the score according to the IQR borders. To calculate the normalization factor, we have to model the relation between different percentiles; this can be done only if we assume some known distribution of the metric. In our case we assumed normal distribution which is the most common distribution for physical properties. Note it is not necessarily the most common distribution for Kusto metrics (that monitor cloud resources and IoT devices), but we could not think of any other preferred distribution.

We want to normalize such that ‘mild outlier’ whose score is k=1.5 using the standard 25th/75th percentiles will get the same score when using different percentiles, denoted by P-low and P-high. The score is 0 by definition if X is inside the IQR (i.e. X(P-low) <= X <= X(P-high)), so we should normalize values above X(P-high) or below X(P-low). Let’s start with high outliers.

The score calculation is:

  1. K = (X – X(P-high))/(X(P-high)-X(P-low)).

We have 2 equations:

  1. K = (X – X(P-high))/(X(P-high)-X(P-low)) * NormalizationFactor = (X – X(p75))/(X(P75)-X(P25))
  2. K = (X – X(P75))/(X(P75)-X(25)) = 1.5

When assuming normal distribution we can convert percentiles to Z-score which is based on standard deviations. From the Standard normal table - Wikipedia we can calculate the respective Z score for specific percentiles. For 25th and 75th percentiles the respective Z are (-0.676, 0.676), i.e. for normal distribution 50% [25th-75th] of the samples are in the range between (avg-0.676stdv, avg+0.676stdv). So we can transform the above equations to these:

  1. K = (X – Z(P-high))/(Z(P-high)-Z(P-low)) * NormalizationFactor = (X – Z(p75))/(Z(P75)-Z(P25))
  2. K = (X – Z(P75))/(Z(P75)-Z(25)) = 1.5

From (5) we can find X:

  1. 1.5 = (X – 0.676)/(0.676-(-0.676)) => X = 1.520.676+0.676 = 2.704

Solving (4):

  1. NormalizationFactor = (2.704 – 0.676) (Z(P-high)-Z(P-low))/((2.704-Z(P-high))(2*0.676))

If we take P-low=10th, P-high=90th then Z(P-low) = -1.2968, Z(P-high) = 1.2968, NormalizationFactor = (2.704-0.676)21.2968/((2.704-1.2968)20.676)=2.764639

We can do a similar calculation for low outliers.

bhavnajindal commented 6 months ago

Thank you @adieldar for this information.