titicaca / spark-iforest

Isolation Forest on Spark
Apache License 2.0
227 stars 89 forks source link

featureIdx shuffling results in wrong featureIndex in the tree #7

Closed hazemsoliman115 closed 5 years ago

hazemsoliman115 commented 5 years ago

I am testing with the following data [1.0, 2.0, 2.5, 0.2, 2.3, 5.0, 0.75, 0.9, 1.3, 2.4, 1.9, 0.45, 10.3, 20.4, 10.9, 10.45]

and the following default parameters IForest iForest = new IForest().setNumTrees(5) .setMaxSamples(3) .setContamination(0.3) .setBootstrap(false) .setMaxDepth(2) .setSeed(123456L);

The trees are as follows: tree[0] featureIndex: 1 featureValue: 9.417743965315601 tree[1] featureIndex: 0 featureValue: 4.977936221311794 tree[2] featureIndex: 0 featureValue: 4.866908154888555 tree[3] featureIndex: 1 featureValue: 9.391937564448492 tree[4] featureIndex: 0 featureValue: 20.26467549071234

The final tree has a featureValue outside the range of values for featureIndex=0, i.e. between 1.0 -> 10.3.

I tracked the issue to line 553 in iForest.scala, where a shuffling operation happens on the feature indices, this reordering seems to be lost afterwards resulting later in wrong attrIndex. The attrIndex was based on the shuffled data not the original one.

hazemsoliman115 commented 5 years ago

Another example where the features are chosen to be in different values range: Data: [1.0, 20.0, 200.5, 0.002 2.3, 50.0, 300.75, 0.009 1.3, 20.4, 100.9, 0.0045 10.3, 200.4, 1000.9, 10.45]

and the resulting trees are: tree[0] featureIndex: 1 featureValue: 9.371498894961741

tree[1] featureIndex: 0 featureValue: 199.07323144154924

tree[2] featureIndex: 0 featureValue: 192.39674371396512

tree[3] featureIndex: 1 featureValue: 9.371498894961741

tree[4] featureIndex: 0 featureValue: 199.07323144154924

titicaca commented 5 years ago

Thanks for reporting the problem. It was caused by the features sampling. I have just fixed it. You can check the latest codes in the master branch.