rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[FEA] Isolation Forest implementation with FIL inference capability #3838

Open tzemicheal opened 3 years ago

tzemicheal commented 3 years ago

Is your feature request related to a problem? Please describe. An implementation to IsolationForest (unsupervised tree based anomaly detection). The scikit has the following implementation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html. The implementation of isolationForest could be extended from existing random forest/decision tree algorithm in rapids and could take advantage of the fast inference.

Describe the solution you'd like This is also related to earlier feature request for extraTreeRegression https://github.com/rapidsai/cuml/issues/3063

Additional context It is one of widely used unsupervised anomaly detection algorithms in practice.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

hcho3 commented 2 years ago

With https://github.com/dmlc/treelite/pull/322, it might be possible to support isolation forests in FIL.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

hcho3 commented 2 years ago

The remaining piece is to support additional transformation mode in FIL: apply f(x) = exp2(-x / c) to predicted scores, where c is provided by the Treelite model.

tzemicheal commented 2 years ago

Hi @hcho3 I tested using FIL usingForestInference.load_from_sklearn for random forest with plan to test iForest trained from sklearn by loading same load function. It looks FIL is producing error for randomForest model trained using sklearn. Could this fix help iForest inference in FIL? Here is the detail error

X, y = sklearn.datasets.load_boston(return_X_y=True)
clf = sklearn.ensemble.RandomForestRegressor(n_estimators=10)
clf.fit(X, y)

model = treelite.sklearn.import_model(clf)
# save model 
model.export_lib(toolchain="gcc", libpath='rf_forest.so', verbose=True)

fil_model = ForestInference.load_from_sklearn(
    skl_model="rf_forest.so",
    algo='BATCH_TREE_REORG',
    output_class=False,
    threshold=0.50
)

TreeliteError                             Traceback (most recent call last)
<ipython-input-56-9fd5f432c56d> in <module>
      3     algo='BATCH_TREE_REORG',
      4     output_class=False,
----> 5     threshold=0.50
      6 )

cuml/fil/fil.pyx in cuml.fil.fil.ForestInference.load_from_sklearn()

/opt/conda/envs/rapids/lib/python3.7/site-packages/treelite/sklearn/importer.py in import_model(sklearn_model)
    126         leaf_value_expected_shape = lambda node_count: (node_count, 1, sklearn_model.n_classes_)
    127     else:
--> 128         raise TreeliteError(f'Not supported model type: {sklearn_model.__class__.__name__}')
    129 
    130     if isinstance(sklearn_model,

TreeliteError: Not supported model type: str
hcho3 commented 2 years ago

We haven't gotten around for adding support for isolation forest in FIL. So the error is expected.

hcho3 commented 2 weeks ago

Update: the experimental version of FIL is now compatible with IsolationForest.

import numpy as np
import treelite
from sklearn.ensemble import IsolationForest
from cuml.experimental import ForestInference

n_samples, n_outliers = 120, 40
rng = np.random.RandomState(0)
covariance = np.array([[0.5, -0.1], [0.7, 0.4]])
cluster_1 = 0.4 * rng.randn(n_samples, 2) @ covariance + np.array([2, 2])  # general
cluster_2 = 0.3 * rng.randn(n_samples, 2) + np.array([-2, -2])  # spherical
outliers = rng.uniform(low=-4, high=4, size=(n_outliers, 2))

X = np.concatenate([cluster_1, cluster_2, outliers]).astype("float32")
y = np.concatenate(
    [np.ones((2 * n_samples), dtype=int), -np.ones((n_outliers), dtype=int)]
)

clf = IsolationForest(max_samples=100, random_state=0)
clf.fit(X)

expected_pred = -clf.score_samples(X).reshape((-1, 1))

fm = ForestInference.load_from_sklearn(clf, output_class=False)
out_pred = fm.predict(X)
np.testing.assert_almost_equal(out_pred, expected_pred, decimal=3)

Note that currently FIL matches the output of score_samples, not decision_function.