linkedin / isolation-forest

A distributed Spark/Scala implementation of the isolation forest algorithm for unsupervised outlier detection, featuring support for scalable training and ONNX export for seamless cross-platform inference.
Other
229 stars 47 forks source link

Wrong count of anomalies without respecting contamination #3

Closed pramodreddy2006 closed 5 years ago

pramodreddy2006 commented 5 years ago

I expected that anomaly count and OutlierScore will change according to contamination provided. But when I try to modify the contamination value, at times it gives the anomaly count as 1 as outlier score is calculated wrong(equivalent to top score).

jverbus commented 5 years ago

Hi @pramodreddy2006, thanks for reaching out!

Can you please share the exact parameter settings you were using when you encountered this potential issue? If you're able to share the data used for training and scoring that would be ideal.

Thanks!

pramodreddy2006 commented 5 years ago

Archive.zip

Attached the archive containing data file(csv) and my java class. Works fine if contamination is 0.05, but I get only 1 anomaly when set to 0.1

           ```
            IsolationForest isolationForest = new IsolationForest();
    isolationForest.setNumEstimators(100);
    // isolationForest.setContamination(0.05);
    isolationForest.setContamination(0.1);
    isolationForest.setFeaturesCol("indexedFeatures");
    IsolationForestModel ifModel = isolationForest.fit(dataset);
    dataset = ifModel.transform(dataset);
    Dataset<Row> countDF = dataset.groupBy("predictedLabel").count();
    List<Row> countRows = countDF.collectAsList();
    Long regular = 0l;
    Long anomalies = 0l;
    for (Row row : countRows) {
        Double label = row.getAs("predictedLabel");
        long count = row.getAs("count");
        if (label > 0) {
            anomalies = count;
        } else {
            regular = count;
        }
    }
    System.out.println("Regular :" + regular);
    System.out.println("Anomalies :" + anomalies);
    System.out.println("Outlier Score Threshold :" + ifModel.getOutlierScoreThreshold());
jverbus commented 5 years ago

Thanks for the additional information!

I translated some of your code into Scala and reproduced the issue.

import org.apache.spark.ml.feature.Imputer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorIndexer

import com.linkedin.relevance.isolationforest._

val path = "/Users/jverbus//Desktop/Archive/breast_cancer_data.csv"
val df = spark.read.option("header", true).csv(path)

val featureColumnList = Array(
  "clump_thickness",
  "cell_size_uniformity",
  "cell_shape_uniformity",
  "marginal_adhesion",
  "single_ep_cell_size",
  "bare_nuclei",
  "bland_chromatin",
  "normal_nucleoli",
  "mitoses")

val dfCast = df
  .withColumn("clump_thickness", df("clump_thickness").cast("double"))
  .withColumn("cell_size_uniformity", df("cell_size_uniformity").cast("double"))
  .withColumn("cell_shape_uniformity", df("cell_shape_uniformity").cast("double"))
  .withColumn("marginal_adhesion", df("marginal_adhesion").cast("double"))  
  .withColumn("single_ep_cell_size", df("single_ep_cell_size").cast("double"))
  .withColumn("bare_nuclei", df("bare_nuclei").cast("double"))
  .withColumn("bland_chromatin", df("bland_chromatin").cast("double"))
  .withColumn("normal_nucleoli", df("normal_nucleoli").cast("double"))
  .withColumn("mitoses", df("mitoses").cast("double"))

val imputer = new Imputer()
imputer.setInputCols(featureColumnList)
imputer.setOutputCols(featureColumnList)

val imputerModel = imputer.fit(dfCast)
val dfCastImputed = imputerModel.transform(dfCast)

val assembler = new VectorAssembler()
  .setInputCols(featureColumnList)
  .setOutputCol("features")
val dfCastImputedAssembled = assembler.transform(dfCastImputed)

val vectorIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
    .setMaxCategories(4)

val vectorIndexerModel = vectorIndexer.fit(dfCastImputedAssembled)
val dfCastImputedAssembledIndexed = vectorIndexerModel.transform(dfCastImputedAssembled)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.05)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel05 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores05 = isolationForestModel05.transform(dfCastImputedAssembledIndexed)

val isolationForest = new IsolationForest()
isolationForest.setNumEstimators(100)
isolationForest.setContamination(0.1)
isolationForest.setFeaturesCol("indexedFeatures")

val isolationForestModel10 = isolationForest.fit(dfCastImputedAssembledIndexed)
val scores10 = isolationForestModel10.transform(dfCastImputedAssembledIndexed)

Which gives the results:

scala> scores05.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|               35.0|
+-------------------+

scala> scores10.agg(sum("predictedLabel")).show()
+-------------------+
|sum(predictedLabel)|
+-------------------+
|                1.0|
+-------------------+
jverbus commented 5 years ago

As we expect, the scores are the same regardless of the contamination choice:

scala> scores10.select("outlierScore").collect().deep == scores05.select("outlierScore").collect().deep
res25: Boolean = true

The odd behavior for contamination = 0.10 case must be due to the choice of the outlier score threshold, which is calculated using Spark's approxQuantile method:

https://github.com/linkedin/isolation-forest/blob/master/isolation-forest/src/main/scala/com/linkedin/relevance/isolationforest/IsolationForest.scala#L141

We use this because it can be very costly to calculate the threshold exactly on very large datasets.

This file contains Spark's approxQuantile method: https://github.com/apache/spark/blob/eef3abbb903f95178f225ae0f6e3db2d9cf64175/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L61

jverbus commented 5 years ago

This is the threshold calculated by the model with a contamination of 0.05:

scala> isolationForestModel05.getOutlierScoreThreshold
res23: Double = 0.6185162204274052

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0.05 * 0.01)
res26: Array[Double] = Array(0.6185162204274052)

If we set the relativeError to 0, which will force an exact calculation, we get the same result:

scala> scores05.stat.approxQuantile("outlierScore", Array(1 - 0.05), 0)
res27: Array[Double] = Array(0.6185162204274052)

However, the case for the 0.10 threshold is different...

This is the threshold calculated by the model with a contamination of 0.10:

scala> isolationForestModel10.getOutlierScoreThreshold
res24: Double = 0.67621027121925

We can reproduce this result using the approxQuantile method on our scores dataframe:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1),  0.10 * 0.01)
res29: Array[Double] = Array(0.67621027121925)

If we set the relativeError to 0, which will force an exact calculation, we get a different result:

scala> scores10.stat.approxQuantile("outlierScore", Array(1 - 0.1), 0)
res28: Array[Double] = Array(0.5929591082174609)
jverbus commented 5 years ago

If we use this exactly calculated threshold for the contamination = 0.10 case, we get reasonable results:

scala> val newScores10 = scores10.select("outlierScore").withColumn("newLabel", (col("outlierScore") >= 0.5929591082174609).cast("double"))
newScores10: org.apache.spark.sql.DataFrame = [outlierScore: double, newLabel: double]

scala> newScores10.agg(sum("newLabel")).show()
+-------------+
|sum(newLabel)|
+-------------+
|         70.0|
+-------------+

scala> newScores10.count()
res44: Long = 699

70 / 699 = 0.10

jverbus commented 5 years ago

It is important to note this is a rare case. The model was validated on 12 benchmark datasets with varying contamination values without seeing this issue.

We need to keep the option of an approximate threshold calculation, because an exact calculation can cause issues on very large datasets.

I will add a model parameter that allows the user to choose if they want to exactly calculate the threshold. I will also add a check that will give a warning if the threshold is calculated approximately and the results don't make sense.

In the interim, you can calculate your own exact threshold and corresponding labels based upon the scores dataframe as shown above.

pramodreddy2006 commented 5 years ago

@jverbus
Got it. Thanks a lot James.

jverbus commented 5 years ago

@pramodreddy2006: Happy to help! Thanks for raising this!

jverbus commented 5 years ago

@pramodreddy2006: I just pushed a fix to the issue you reported.

https://github.com/linkedin/isolation-forest/pull/4

The library now uses an exact calculation of the threshold by default (slow and not as scalable). There is a new parameter contaminationError that you can specify if you want an approximate, but fast and scalable threshold calculation.

The underlying issue with Spark's approxQuantile() method is still there, so the approximate threshold calculation may occasionally have the bug you observed. If there is a disagreement between the expected and observed number of outliers during training, a warning will be shown to the user.

I reported the approxQuantile() bug to the Spark team. https://issues.apache.org/jira/browse/SPARK-29325

Please try version 0.3.0 out and let me know if it works for you!

pramodreddy2006 commented 5 years ago

@jverbus Thanks. Works as explained.