Implement Extended Isolation Forest

groud commented 4 years ago

Describe the workflow you want to enable

In the context of anomaly detection, the isolation forest algorithm has a bias making data points' anomaly scores lower than what they actually should be. This problem arises for areas in the space which are axis-aligned with a cluster. Imagine a point very far from a cluster, the basic Isolation Forest algorithm may assign it a lower anomaly score only because the point is axis aligned with the cluster. This does lead to false negative in my application.

To overcome this bias, Hariri et al proposed the Extended Isolation Forest algorithm. While the normal Isolation Forests algorithm randomly chooses a feature and a threshold value to split the points, the extended version uses a random hyperplane to do so. Those random hyperplanes, as they might not be axis-aligned, remove the bias caused by the standard algorithm. In the end, the standard algorithm becomes a subset of the extended version one, but using only axis-aligned hyperplanes.

Please have a look to the original paper, it explains the problem very well.

Describe your proposed solution

I had a look to the Isolation Forest code, in my humble opinion, the simplest solution might be to add an argument to the IsolationForest class constructor to choose how the samples should be split into two. Maybe something like an extended argument.

This would basically modify the splitter argument passed to the underlying base_estimator (an ExtraTreeRegressor instance). We could add a "random_hyperplane" splitting mode, which requires implementing a new Splitter class.

Globally, this is not a lot of changes but adding the new splitter class. I can work on an implementation if we agree this is a useful addition.

Describe alternatives you've considered, if relevant

Maybe the Extended Isolation Forest algorithm should be a distinct class, but I doubt it is worth it.

NicolasHug commented 4 years ago

(Hi Gilles!!)

The paper seems rather new and may not satisfy our inclusion criteria yet.

It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though

Implem-wise, I believe @MrAE has been working on it before and it turned out to be fairly tricky?

groud commented 4 years ago

Hi Nicolas, nice seeing you there ! :)

The paper seems rather new and may not satisfy our inclusion criteria yet.

Yeah, I was not sure about that. I wanted to propose it nevertheless since it looked to me as a rather "small" addition, and because this addition solve a problematic bias in the algorithm (more than it brings a new feature). But I agree the paper may be to recent to be accepted, I guess it depends on your policy regarding such changes.

It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though

Yes, it looks like so.

funnell commented 4 years ago

Considering the problematic bias, maybe it would be better to have the "extended" algorithm replace the original, unless you can conceive a situation where you'd still want the original. The extended algorithm looks consistently better in the paper.

zachmayer commented 3 years ago

It's worth noting there's already a python package for extended isolation forests: https://github.com/sahandha/eif

zachmayer commented 3 years ago

I think this idea could also be applied to the ExtraTreesClassifier ExtraTreesRegressor, for an extra degree of randomness

psmgeelen commented 2 years ago

Hi there, what is the latest on this? I am trying to use the library as can be found on: https://github.com/sahandha/eif, but there are a couple of issues with it, particularly in pickling the object. A pull request for a python/numba implementation has already been delivered via a pull-request but seems to be unanswered https://github.com/sahandha/eif/pull/24. What more do we need? I want to add that this algorithm is probably one of the best algorithms there are for anomaly detection on correlated machine data. It really beats the non-extended isolation forest by a lot and have had huge success with this in the industry. So.. in my opinion very worthwhile adding :).

valenad1 commented 2 years ago

@psmgeelen If you would like to have another python implementation, there is also Extended Isolation Forest by H2O-3 open-source: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/eif.html

Any feedback will be appreciated.

psmgeelen commented 2 years ago

Hi @valenad1, I am using the eif_new.py from the pull request that is extremely fast (build with numba). Would this be interesting to add to scikit?

thomasjpfan commented 2 years ago

Currently, it would be difficult to introduce numba as a dependency to scikit-learn.

As for extended isolation forest, as mentioned in https://github.com/scikit-learn/scikit-learn/issues/16517#issuecomment-589829247 the paper does not meet scikit-learn's inclusion criterion.

Recently there has been some interest in using non axis-aligned hyperplanes for trees. Taken together, maybe there is some merit for inclusion.

REF: https://github.com/scikit-learn/scikit-learn/issues/20819

lpryszcz commented 2 years ago

I'm behind eif_new.py. I'm glad there is an interest and I'd be more than happy to contribute the code to scikit-learn.

As for numba, the code can run without it, but would be much slower (I can dig examples if needed).
There are only couple of very simple functions that use numba. Those could be implemented in Cython. Unfortunately, I'm not fluent enough with Cython to write those functions from scratch. Maybe someone could help with that?

zachmayer commented 2 years ago

@thomasjpfan just to be clear, you'd be open to a random hyperplane splitter, if it could be used for isolation forests, extra trees, and random forests?

Only splitting along an axis is a major limitation of tree-based algorithms, especially for anomaly detection. It'd be really nice to have this functionality in sklearn.

chanansh commented 1 year ago

cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)? ,e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

zachmayer commented 1 year ago

I hadn’t thought of that but it seems possible!

On Tue, Dec 13, 2022 at 2:15 AM Dr. Hanan Shteingart < @.***> wrote:

cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)?

— Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/16517#issuecomment-1347845969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEN7VTPB6ZRXSUEMGJAQODWNAPBDANCNFSM4KZHCS3A . You are receiving this because you commented.Message ID: @.***>

adam2392 commented 1 year ago

cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)? ,e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

This requires though expanding the feature dimensionality, whereas performing the actual oblique splits during the tree construction would not require computing/storing these extra projections apriori. It is analogous to one-hot-encoding to allow categorical splits in RFs, but rather handling the categorical data natively within the construction of the RF itself would be better (e.g. see XGBoost, HistGradientBoosting in sklearn, LightGBM, CatBoost, etc.)

adam2392 commented 1 year ago

Only splitting along an axis is a major limitation of tree-based algorithms, especially for anomaly detection. It'd be really nice to have this functionality in sklearn.

@zachmayer and for those interested and following this thread: we implemented EIF in scikit-tree in this PR. Just a quick summary of features:

pickleable
Cythonized, so I presume it is fast. However, it would be great if someone who is invested in it to benchmark if this is "fast enough" for their use-cases?
scikit-learn API
control over the sparsity of the linear combinations to consider via feature_combinations parameter

Here is an example that reproduces various figures from the original paper and shows how to use the actual API.

ciurlaro commented 4 days ago

(Hi Gilles!!)

The paper seems rather new and may not satisfy our inclusion criteria yet.

It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though

Implem-wise, I believe @MrAE has been working on it before and it turned out to be fairly tricky?

I would say that this is basically not true anymore (ref):

We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.

[x] At least 3 years since publication
[x] Wide use and usefulness (other important libraries implement it already, among which h20)
[x] A technique that provides a clear-cut improvement [...] will also be considered for inclusion.
[ ] 200+ citations (182 at the moment, still enough given the 3rd point imho)

What is the status at the current date?

Wouldn't it be the case to reconsider its implementation?

oulenz commented 3 days ago

Regarding performance, I'd just like to note that when we evaluated EIF in a semi-supervised setting (i.e. training set contains only normal records, test set contains both normal records and anomalies), it didn't outperform the normal IF:

https://arxiv.org/abs/2101.11037 (Table 5) (Published in Pattern Recognition)

My suspicion is that the examples illustrated by the heat maps in the paper, while very suggestive, are also somewhat artificial and maybe not so representative of real-life datasets, in particular because real-life datasets generally have more than two dimensions, and records will typically not lie on one of the axes.

That said, the situation may be different for unsupervised anomaly detection (where there is no train/test set division), and I personally do think the proposal is conceptually interesting enough to include in scikit-learn.

adam2392 commented 3 days ago

I believe the underlying issue is that the decision tree code currently assumes axis-aligned splits, which is inherent in the design. EIF would require supporting oblique splits, which is not a trivial task.

Cross-ref: #20819

scikit-learn / scikit-learn