Threshold calculation doesn't considers training data(fitted), anomaly score doesn't work for new data

abdulnyctale commented 5 years ago

Hi,

I would like to thankyou first for implementing the library, Before integrating this library into our spark project we went to test it with. We used the same dataset on sckit learn and this and it doesn't work for new data(anomalies) and labels them as normal data. I guess it calculates threshold with respect to this new data resulting in declaring few of them as anomalies and rest of them "0'. Although score are quite high (between 0.60-0.65).

If i make a new test data by appending the anomalies data to the training data and predict it, it correctly label them as anomalies. So i think that this is the problem with the threshold calculation.

Here is the example

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
Xtrain = map(lambda x: Vectors.dense(x), X_train)
dfltrain = pd.DataFrame(list(Xtrain))
dfnbtrain = spark.createDataFrame(dfltrain,["features"])

# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
Xtest = map(lambda x: Vectors.dense(x), X_test)
dfltest = pd.DataFrame(list(Xtest))
dfnbtest = spark.createDataFrame(dfltest,["features"])

# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))
XOutliers = map(lambda x: Vectors.dense(x), X_outliers)
dflOutliers = pd.DataFrame(list(XOutliers))
dfnbOutliers = spark.createDataFrame(dflOutliers,["features"])
# Init an IForest Object
iforest = IForest(maxSamples=100)
iforest.setSeed(42)
# Fit on a given data frame
model = iforest.fit(df)
y_pred_train = model.transform(dfnbtrain)
y_pred_test = model.transform(dfnbtest)
y_pred_outliers = model.transform(dfnbOutliers)

print("Accuracy:",y_pred_test.groupby("prediction").count().collect()[0]["count"]/y_pred_test.count())
#Accuracy: 0.9
print("Accuracy:",y_pred_outliers.groupby("prediction").count().collect()[1]["count"]/y_pred_outliers.count())
#Accuracy: 0.1

As you can see that for self generated and obvious anomalies, it has wrong predictions but scores are quite high

[Row(features=DenseVector([-0.4882, -3.3723]), anomalyScore=0.6670575241125163, prediction=0.0), Row(features=DenseVector([-3.7972, 3.7012]), anomalyScore=0.6854332724633916, prediction=1.0), Row(features=DenseVector([2.6878, 1.5678]), anomalyScore=0.6374480738168643, prediction=0.0), Row(features=DenseVector([-0.7284, -2.6136]), anomalyScore=0.6566451502774648, prediction=0.0), Row(features=DenseVector([-2.7485, -1.9981]), anomalyScore=0.5795116354078663, prediction=0.0), Row(features=DenseVector([0.3938, 1.7168]), anomalyScore=0.5973296265149223, prediction=0.0), Row(features=DenseVector([1.2816, -1.7605]), anomalyScore=0.6347619060759053, prediction=0.0), Row(features=DenseVector([3.6389, 1.9032]), anomalyScore=0.6140531002586469, prediction=0.0), Row(features=DenseVector([0.4348, 0.8938]), anomalyScore=0.6523854328525718, prediction=0.0), Row(features=DenseVector([-0.6432, -2.0182]), anomalyScore=0.6014666752642815, prediction=0.0), Row(features=DenseVector([-1.1522, 2.0628]), anomalyScore=0.6080298691384164, prediction=0.0), Row(features=DenseVector([-3.8849, -3.0714]), anomalyScore=0.6730886352051572, prediction=0.0), Row(features=DenseVector([-3.632, -3.6742]), anomalyScore=0.6730886352051572, prediction=0.0), Row(features=DenseVector([2.8437, 1.6293]), anomalyScore=0.6342865718344055, prediction=0.0), Row(features=DenseVector([-0.2066, -3.2173]), anomalyScore=0.6725254553370426, prediction=0.0), Row(features=DenseVector([-0.0671, -0.2122]), anomalyScore=0.6600221092919962, prediction=0.0), Row(features=DenseVector([-2.6144, -0.5292]), anomalyScore=0.6569516722831921, prediction=0.0), Row(features=DenseVector([-0.812, 0.9268]), anomalyScore=0.6534675412783653, prediction=0.0), Row(features=DenseVector([1.0807, -3.6376]), anomalyScore=0.6764378581357128, prediction=1.0), Row(features=DenseVector([-1.0031, 1.0069]), anomalyScore=0.6521343894913213, prediction=0.0)]

By Appending it to training data and predicting, it labels correctly.

So i think you need to move threshold to model fitting part.

titicaca commented 5 years ago

Thanks for reporting your tests.

The predictions are predicted based on the contamination rate. Threshold is calculated according to the param contamination for each predicting dataset.

titicaca commented 5 years ago

I have just added a param threshold in IForestModel for your reported cases. The threshold will be remembered in the IForestModel after the model fitting, and you can also set your own threshold now.

abdulnyctale commented 5 years ago

Thankyou for the fix, it works now

titicaca / spark-iforest

Threshold calculation doesn't considers training data(fitted), anomaly score doesn't work for new data #8