linkedin / isolation-forest

A Spark/Scala implementation of the isolation forest unsupervised outlier detection algorithm with support for exporting in ONNX format.
Other
223 stars 47 forks source link

Approx quantile bug fix #4

Closed jverbus closed 4 years ago

jverbus commented 4 years ago

Fixed the rare unexpected behavior reported here https://github.com/linkedin/isolation-forest/issues/3 that was due to an issue with Spark's approxQuantile method.

I made two changes to work around the unexpected approxQuantile behavior.

1) The isolation forest model now accepts a contaminationError parameter, which allows the user to set approxQuantile's relativeError parameter. The default value is now 0.0, which forces an exact calculation.

2) The isolation forest model now calculates the observed contamination after training a model and compares it to the expected contamination set via model parameters. If the deviation is unexpectedly large, a warning issued.

I also added a new unit test to validate the new relativeError = 0.0 case.

Testing done:

The build passes: https://travis-ci.org/linkedin/isolation-forest/builds/591787128

If the old (inflexible) contaminationError value is used, I verified a warning message is now displayed in the case reported in issue #3.

val vectorIndexerModel = vectorIndexer.fit(dfCastImputedAssembled)
val dfCastImputedAssembledIndexed = vectorIndexerModel.transform(dfCastImputedAssembled)

val isolationForest05 = new IsolationForest()
isolationForest05.setNumEstimators(100)
isolationForest05.setContamination(0.05)
isolationForest05.setFeaturesCol("indexedFeatures")
isolationForest05.setContaminationError(0.05 * 0.01)

val isolationForestModel05 = isolationForest05.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)
2019-09-30 18:29:19 WARN  IsolationForest:66 - Observed contamination is 0.001430615164520744, which is outside of the expected bounds of 0.1 +/- 0.001.
scala> scores05.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|               35.0|
+-------------------+

scala> scores10.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|                1.0|
+-------------------+

The scores10 case is wrong and should be ~70.

I also verified that if the new default is used (contaminationError = 0.0), the model does not throw a warning message and correctly calculates the threshold for the case reported in issue #3.

val isolationForest05 = new IsolationForest()
isolationForest05.setNumEstimators(100)
isolationForest05.setContamination(0.05)
isolationForest05.setFeaturesCol("indexedFeatures")
// isolationForest05.setContaminationError(0.05 * 0.01)

val isolationForestModel05 = isolationForest05.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)

val isolationForest10 = new IsolationForest()
isolationForest10.setNumEstimators(100)
isolationForest10.setContamination(0.1)
isolationForest10.setFeaturesCol("indexedFeatures")
// isolationForest10.setContaminationError(0.1 * 0.01)

val isolationForestModel10 = isolationForest10.fit(dfCastImputedAssembledIndexed)
val scores10 = isolationForestModel10.transform(dfCastImputedAssembledIndexed)

scala> scores05.agg(sum("predictedLabel")).show() +-------------------+ |sum(predictedLabel)| +-------------------+ | 35.0| +-------------------+

scala> scores10.agg(sum("predictedLabel")).show() +-------------------+ |sum(predictedLabel)| +-------------------+ | 70.0| +-------------------+

jverbus commented 4 years ago

For reference, here is an example demonstrating the underlying odd behavior with approxQuantile(). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter.

scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]

scala> df
res5: org.apache.spark.sql.DataFrame = [value: double]

scala> df.stat.approxQuantile("value", Array(0.9), 0)
res6: Array[Double] = Array(0.5929591082174609)

scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res7: Array[Double] = Array(0.67621027121925)

scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res8: Array[Double] = Array(0.5926195654486178)

scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res9: Array[Double] = Array(0.5924693999048418)

scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res10: Array[Double] = Array(0.67621027121925)

scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res11: Array[Double] = Array(0.5923925937051544)

Here is the data used for this example: 20191001_example_data_approx_quantile_bug.zip

jverbus commented 4 years ago

I tried on the latest Spark version (2.4.4) and got the same result.

Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.4
      /_/

Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]

scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)

scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)

scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)

scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)

scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)

scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544)
jverbus commented 4 years ago

I reported the bug to the Spark project as well.

https://issues.apache.org/jira/browse/SPARK-29325