rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.24k stars 532 forks source link

[FEA] Support using cuML estimators inside sklearn meta-estimators #2401

Closed beckernick closed 4 years ago

beckernick commented 4 years ago

Meta-estimators allow users to combine multiple models into a single model for potentially enhanced predictive power. A minimal example of this concept is a Voting Classifier, in which the predictions of 1+ models are collected and a vote is made of which label to assign based on the predictions of the input models. Another powerful approach is stacking, in which the results of 1+ models are fed into another model to make a final prediction. Scikit-learn provides an API for this as well.

While we likely will want to implement analogous APIs for cuML to support end-to-end GPU inputs, the recent work standardizing estimators on an input/output type contract makes it almost possible to use scikit-learn meta-estimators with cuML on CPU arrays. Essentially, this is just drop-in replacing sklearn estimators with cuML estimators in the meta-estimator constructor. @dantegd and I have explored this a bit locally, and we are seeing large training speedups.

The following issues cover the necessary changes that would allow using cuML models with VotingClassifier and StackedClassifier, with others potentially possible as well.

Scikit-learn examples:

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
import numpy as np
​
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
​
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
​
eclf1 = VotingClassifier(estimators=[
        ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
print(eclf1.predict(X))
[1 1 1 2 2 2]
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(),
                          LinearSVC(random_state=42)))
]
clf = StackingClassifier(
    estimators=estimators, final_estimator=LogisticRegression()
)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42
)
clf.fit(X_train, y_train).score(X_test, y_test)
0.9473684210526315
beckernick commented 4 years ago

Small oversight in my initial scoping. The original list of issues will only unlock some of the meta-estimators. We'll need to add a self.classes_ attribute to unlock the remaining meta estimators for classification (not needed for regression meta-estimators). However, AdaBoostClassifier will only be unlocked estimator by estimator as we need to support sample weights for our cuML base estimator of choice.

Filing an issue for the classes_ attribute

beckernick commented 4 years ago

With the merging of #2487 , we now essentially support these meta-estimators. AdaBoostClassifier support will come estimator by estimator as they gain sample weights functionality.

Closing.