Closed jverbus closed 4 years ago
For reference, here is an example demonstrating the underlying odd behavior with approxQuantile(). You can see the result varies significantly for modest changes in the specified relativeError parameter. The result varies much more than the magnitude of the relativeError parameter.
scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]
scala> df
res5: org.apache.spark.sql.DataFrame = [value: double]
scala> df.stat.approxQuantile("value", Array(0.9), 0)
res6: Array[Double] = Array(0.5929591082174609)
scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res7: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res8: Array[Double] = Array(0.5926195654486178)
scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res9: Array[Double] = Array(0.5924693999048418)
scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res10: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res11: Array[Double] = Array(0.5923925937051544)
Here is the data used for this example: 20191001_example_data_approx_quantile_bug.zip
I tried on the latest Spark version (2.4.4) and got the same result.
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.4.4
/_/
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_212)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./20191001_example_data_approx_quantile_bug")
df: org.apache.spark.sql.DataFrame = [value: double]
scala> df.stat.approxQuantile("value", Array(0.9), 0)
res0: Array[Double] = Array(0.5929591082174609)
scala> df.stat.approxQuantile("value", Array(0.9), 0.001)
res1: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.002)
res2: Array[Double] = Array(0.5926195654486178)
scala> df.stat.approxQuantile("value", Array(0.9), 0.003)
res3: Array[Double] = Array(0.5924693999048418)
scala> df.stat.approxQuantile("value", Array(0.9), 0.004)
res4: Array[Double] = Array(0.67621027121925)
scala> df.stat.approxQuantile("value", Array(0.9), 0.005)
res5: Array[Double] = Array(0.5923925937051544)
I reported the bug to the Spark project as well.
Fixed the rare unexpected behavior reported here https://github.com/linkedin/isolation-forest/issues/3 that was due to an issue with Spark's approxQuantile method.
I made two changes to work around the unexpected approxQuantile behavior.
1) The isolation forest model now accepts a contaminationError parameter, which allows the user to set approxQuantile's relativeError parameter. The default value is now 0.0, which forces an exact calculation.
2) The isolation forest model now calculates the observed contamination after training a model and compares it to the expected contamination set via model parameters. If the deviation is unexpectedly large, a warning issued.
I also added a new unit test to validate the new relativeError = 0.0 case.
Testing done:
The build passes: https://travis-ci.org/linkedin/isolation-forest/builds/591787128
If the old (inflexible) contaminationError value is used, I verified a warning message is now displayed in the case reported in issue #3.
The scores10 case is wrong and should be ~70.
I also verified that if the new default is used (contaminationError = 0.0), the model does not throw a warning message and correctly calculates the threshold for the case reported in issue #3.
scala> scores05.agg(sum("predictedLabel")).show() +-------------------+ |sum(predictedLabel)| +-------------------+ | 35.0| +-------------------+
scala> scores10.agg(sum("predictedLabel")).show() +-------------------+ |sum(predictedLabel)| +-------------------+ | 70.0| +-------------------+