rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.25k stars 534 forks source link

[FEA] Isolation Forest Training support in cuML #6096

Open singhmanas1 opened 1 month ago

singhmanas1 commented 1 month ago

Is your feature request related to a problem? Please describe. Isolation Forest (IF) is a popular unsupervised anomaly detection method used to identify fraud. Ex. Banks and Retail companies use IF to determine zero day threats i.e new patterns in threats which supervised algorithms like XGBoost and GNN are unable to determine because of class imbalance or other issues.

While cuML supports inferencing on scikit-learn's IF model via ForestInference Library (experimental feature) (Issue #3838), it would be great to have IF model training implemented in cuML similar to the implementation of Isolation Forest in scikit-learn

Describe the solution you'd like Something like below -

from cuml.ensemble import IsolationForest
X = [[-1.1], [0.3], [0.5], [100]]
clf = IsolationForest(random_state=0).fit(X)
clf.predict([[0.1], [0], [90]])

Implementation Details The following needs to be implemented and tested in cuML to enable IF-

  1. Splitting the decision tree randomly while building the trees via NodeSplitKernel
  2. Implementation for calculating path length to detect anomalies similar to scikit-learn implementation HERE
  3. Testing whether data quantization into bins would affect the performance of IsolationForest.

@vinaydes @dantegd @beckernick @hcho3