rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.14k stars 526 forks source link

[FEA] Feature Importance/Selection LOFO + RAPIDS #2530

Open aerdem4 opened 4 years ago

aerdem4 commented 4 years ago

Is your feature request related to a problem? Please describe. I currently maintain LOFO Importance package which is a model agnostic, validation scheme dependent feature importance tool. It can be an initial step to automated feature selection. It currently uses sklearn and pandas. I wanted it to have cuml support but I am missing 2 important sklearn functionalities: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.make_scorer.html https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html

In addition to that, while most of the important metrics are available in cuML, there are still some of them missing compared to sklearn.

Describe the solution you'd like I would like to have make_scorer and cross_validate functions available in cuML so that I can start supporting Rapids. Since this is a costly feature importance algorithm, using Rapids will make it much more feasible. Later it can even be a part of Rapids if you like.

teju85 commented 4 years ago

I think we could support make_scorer with a similar UDF approach like the one currently supported in cuDF. But probably this will not happen until 0.16 or 0.17 timeframe (tagging @dantegd and @JohnZed , as per our our offline discussion on this).

beckernick commented 4 years ago

@aerdem4 , given the recent cuML API enhancements in 0.15, it looks like at least the basic cross_validate might now “just work” for cuML estimators with both CPU or GPU data.

EDIT: It does.

Note that the following is a "hot" run to avoid the JIT overhead in the %time calls.

import cuml
import cupy as cp
from sklearn.datasets import make_regression
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_validate
​
​
NFEATURES = 100
​
X, y = make_regression(
    n_samples=200000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    random_state=12,
    noise=200,
)
​
gX, gy = cp.asarray(X), cp.asarray(y); cp.cuda.Device().synchronize();
​
lasso_cpu = Lasso()
lasso_gpu = cuml.linear_model.Lasso()
​
%time cv_results = cross_validate(lasso_cpu, X, y, cv=3)
%time cv_results = cross_validate(lasso_gpu, gX, gy, cv=3)
CPU times: user 21.8 s, sys: 559 ms, total: 22.4 s
Wall time: 704 ms
CPU times: user 4.53 s, sys: 135 ms, total: 4.66 s
Wall time: 97.2 ms

EDIT: Adding another example:

import cuml
import cupy as cp
from sklearn.datasets import make_regression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_validate
​
​
NFEATURES = 20
​
X, y = make_regression(
    n_samples=100000,
    n_features=NFEATURES,
    n_informative=NFEATURES,
    random_state=12,
    noise=200,
)
​
X = X.astype("float32")
y = y.astype("float32")
​
gX, gy = cp.asarray(X), cp.asarray(y); cp.cuda.Device().synchronize();
​
clf_cpu = KNeighborsRegressor(n_jobs=-1)
clf_gpu = cuml.neighbors.KNeighborsRegressor()
​
%time cv_results = cross_validate(clf_cpu, X, y, cv=3)
%time cv_results = cross_validate(clf_gpu, gX, gy, cv=3)
CPU times: user 8.16 s, sys: 345 ms, total: 8.5 s
Wall time: 7.55 s
CPU times: user 97.1 ms, sys: 420 ms, total: 517 ms
Wall time: 517 ms
aerdem4 commented 4 years ago

@beckernick nice! now it works out of the box with pandas dataframes. I am still getting some issues when I try to run it on cudf dataframes, scorer throws an error: ValueError: object __array__ method not producing an array because __array__ produces cupy.core.core.ndarray.

So this works smooth with pandas dataframes:

from sklearn.metrics import make_scorer, mean_absolute_error

scorer = make_scorer(mean_absolute_error, greater_is_better=False)
cv = KFold(n_splits=4, shuffle=True, random_state=0)

dataset = Dataset(df=df, target="target", features=["A", "B", "C", "D"])
fi = LOFOImportance(dataset, scoring=scorer, model=cuml.LinearRegression(), cv=cv)

importances = fi.get_importance()
importances
github-actions[bot] commented 3 years ago

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.