sahandha / eif

Extended Isolation Forest for Anomaly Detection
Other
445 stars 117 forks source link

Effect of feature scaling #31

Open felixcaz opened 3 years ago

felixcaz commented 3 years ago

Hi thanks for the great package (and example notebooks!). My issue is summarised in two points:

The following illustrates this further:

I have noticed that the extended forest shows odd results when applied to features with very different scales. For example if I draw 2D points from 2 normal distributions with variance 1 and 1000 and plot the contour maps comparing the regular iForest and the extended we see the contours become horizontal and the heat map in general is not good compared to the regular iForest. image

It seems as though the choice of hyperplane gets biased towards horizontal lines. This is also notable in the examples given in the paper (figure 9) where 3 plots of tree splits are shown: image Here we see the first two examples (a and b) the x and y values of the data lie on the same scale and the splits look randomly orientated. However in c) the x scale of the data is much larger than y scale, and most splits look more vertical. As a result we seen areas of higher anomaly score above and below the point cloud in the resulting heat map: image

This issue is easily fixed by simply scaling all features before using the forest. However I was wondering if the splits are done on a hyperplane of random orientation why/how does feature scale influence the orientation of splits in each tree?

Apologies if I am missing something obvious, any insight would be useful, thanks!