Closed abdulnyctale closed 5 years ago
Thanks for reporting your tests.
The predictions are predicted based on the contamination rate. Threshold is calculated according to the param contamination for each predicting dataset.
I have just added a param threshold in IForestModel for your reported cases. The threshold will be remembered in the IForestModel after the model fitting, and you can also set your own threshold now.
Thankyou for the fix, it works now
Hi,
I would like to thankyou first for implementing the library, Before integrating this library into our spark project we went to test it with. We used the same dataset on sckit learn and this and it doesn't work for new data(anomalies) and labels them as normal data. I guess it calculates threshold with respect to this new data resulting in declaring few of them as anomalies and rest of them "0'. Although score are quite high (between
0.60-0.65
).If i make a new test data by appending the anomalies data to the training data and predict it, it correctly label them as anomalies. So i think that this is the problem with the threshold calculation.
Here is the example
As you can see that for self generated and obvious anomalies, it has wrong predictions but scores are quite high
By Appending it to training data and predicting, it labels correctly.
So i think you need to move threshold to model fitting part.