Open groud opened 4 years ago
(Hi Gilles!!)
The paper seems rather new and may not satisfy our inclusion criteria yet.
It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though
Implem-wise, I believe @MrAE has been working on it before and it turned out to be fairly tricky?
Hi Nicolas, nice seeing you there ! :)
The paper seems rather new and may not satisfy our inclusion criteria yet.
Yeah, I was not sure about that. I wanted to propose it nevertheless since it looked to me as a rather "small" addition, and because this addition solve a problematic bias in the algorithm (more than it brings a new feature). But I agree the paper may be to recent to be accepted, I guess it depends on your policy regarding such changes.
It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though
Yes, it looks like so.
Considering the problematic bias, maybe it would be better to have the "extended" algorithm replace the original, unless you can conceive a situation where you'd still want the original. The extended algorithm looks consistently better in the paper.
It's worth noting there's already a python package for extended isolation forests: https://github.com/sahandha/eif
I think this idea could also be applied to the ExtraTreesClassifier ExtraTreesRegressor, for an extra degree of randomness
Hi there, what is the latest on this? I am trying to use the library as can be found on: https://github.com/sahandha/eif, but there are a couple of issues with it, particularly in pickling the object. A pull request for a python/numba implementation has already been delivered via a pull-request but seems to be unanswered https://github.com/sahandha/eif/pull/24. What more do we need? I want to add that this algorithm is probably one of the best algorithms there are for anomaly detection on correlated machine data. It really beats the non-extended isolation forest by a lot and have had huge success with this in the industry. So.. in my opinion very worthwhile adding :).
@psmgeelen If you would like to have another python implementation, there is also Extended Isolation Forest by H2O-3 open-source: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/eif.html
Any feedback will be appreciated.
Hi @valenad1, I am using the eif_new.py
from the pull request that is extremely fast (build with numba). Would this be interesting to add to scikit?
Currently, it would be difficult to introduce numba
as a dependency to scikit-learn.
As for extended isolation forest, as mentioned in https://github.com/scikit-learn/scikit-learn/issues/16517#issuecomment-589829247 the paper does not meet scikit-learn's inclusion criterion.
Recently there has been some interest in using non axis-aligned hyperplanes for trees. Taken together, maybe there is some merit for inclusion.
REF: https://github.com/scikit-learn/scikit-learn/issues/20819
I'm behind eif_new.py
. I'm glad there is an interest and I'd be more than happy to contribute the code to scikit-learn.
As for numba
, the code can run without it, but would be much slower (I can dig examples if needed).
There are only couple of very simple functions that use numba. Those could be implemented in Cython. Unfortunately, I'm not fluent enough with Cython to write those functions from scratch. Maybe someone could help with that?
@thomasjpfan just to be clear, you'd be open to a random hyperplane splitter, if it could be used for isolation forests, extra trees, and random forests?
Only splitting along an axis is a major limitation of tree-based algorithms, especially for anomaly detection. It'd be really nice to have this functionality in sklearn.
cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)? ,e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html
I hadn’t thought of that but it seems possible!
On Tue, Dec 13, 2022 at 2:15 AM Dr. Hanan Shteingart < @.***> wrote:
cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)?
— Reply to this email directly, view it on GitHub https://github.com/scikit-learn/scikit-learn/issues/16517#issuecomment-1347845969, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEN7VTPB6ZRXSUEMGJAQODWNAPBDANCNFSM4KZHCS3A . You are receiving this because you commented.Message ID: @.***>
cannot it be done with just a pre processing step of hyper planes? thus, isn't EIF just a pipeline of (random_projections, isolation_forest)? ,e.g. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html
This requires though expanding the feature dimensionality, whereas performing the actual oblique splits during the tree construction would not require computing/storing these extra projections apriori. It is analogous to one-hot-encoding to allow categorical splits in RFs, but rather handling the categorical data natively within the construction of the RF itself would be better (e.g. see XGBoost, HistGradientBoosting in sklearn, LightGBM, CatBoost, etc.)
Only splitting along an axis is a major limitation of tree-based algorithms, especially for anomaly detection. It'd be really nice to have this functionality in sklearn.
@zachmayer and for those interested and following this thread: we implemented EIF in scikit-tree in this PR. Just a quick summary of features:
feature_combinations
parameterHere is an example that reproduces various figures from the original paper and shows how to use the actual API.
(Hi Gilles!!)
The paper seems rather new and may not satisfy our inclusion criteria yet.
It seems to follow the same idea as https://arxiv.org/abs/1506.03410 though
Implem-wise, I believe @MrAE has been working on it before and it turned out to be fairly tricky?
I would say that this is basically not true anymore (ref):
We only consider well-established algorithms for inclusion. A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness. A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.
What is the status at the current date?
Wouldn't it be the case to reconsider its implementation?
Regarding performance, I'd just like to note that when we evaluated EIF in a semi-supervised setting (i.e. training set contains only normal records, test set contains both normal records and anomalies), it didn't outperform the normal IF:
https://arxiv.org/abs/2101.11037 (Table 5) (Published in Pattern Recognition)
My suspicion is that the examples illustrated by the heat maps in the paper, while very suggestive, are also somewhat artificial and maybe not so representative of real-life datasets, in particular because real-life datasets generally have more than two dimensions, and records will typically not lie on one of the axes.
That said, the situation may be different for unsupervised anomaly detection (where there is no train/test set division), and I personally do think the proposal is conceptually interesting enough to include in scikit-learn.
I believe the underlying issue is that the decision tree code currently assumes axis-aligned splits, which is inherent in the design. EIF would require supporting oblique splits, which is not a trivial task.
Cross-ref: #20819
Describe the workflow you want to enable
In the context of anomaly detection, the isolation forest algorithm has a bias making data points' anomaly scores lower than what they actually should be. This problem arises for areas in the space which are axis-aligned with a cluster. Imagine a point very far from a cluster, the basic Isolation Forest algorithm may assign it a lower anomaly score only because the point is axis aligned with the cluster. This does lead to false negative in my application.
To overcome this bias, Hariri et al proposed the Extended Isolation Forest algorithm. While the normal Isolation Forests algorithm randomly chooses a feature and a threshold value to split the points, the extended version uses a random hyperplane to do so. Those random hyperplanes, as they might not be axis-aligned, remove the bias caused by the standard algorithm. In the end, the standard algorithm becomes a subset of the extended version one, but using only axis-aligned hyperplanes.
Please have a look to the original paper, it explains the problem very well.
Describe your proposed solution
I had a look to the Isolation Forest code, in my humble opinion, the simplest solution might be to add an argument to the
IsolationForest
class constructor to choose how the samples should be split into two. Maybe something like anextended
argument.This would basically modify the
splitter
argument passed to the underlying base_estimator (an ExtraTreeRegressor instance). We could add a "random_hyperplane" splitting mode, which requires implementing a newSplitter
class.Globally, this is not a lot of changes but adding the new splitter class. I can work on an implementation if we agree this is a useful addition.
Describe alternatives you've considered, if relevant
Maybe the Extended Isolation Forest algorithm should be a distinct class, but I doubt it is worth it.