[FEA] scikit-learn based meta estimators

tfeher commented 4 years ago

What are the plans / guidelines of using scikit-learn meta estimators in combination with cuML algorithms?

Using input/output type configurability provides a great way to combine scikit-learn meta estimators with cuML algorithms. One just needs to set the input and output type to numpy, and one can already use existing algorithms from scikit-learn.

Concrete examples:

Model selection methods (like GridSearchCV), works with numpy input & output.
Ensemble methods (AdaBoostRegressor, VotingClassifier) as advertised in the blog.

Meta estimators within cuML Some ML algorithms requires us to use meta estimators under the hood of cuML:

In SVC predic_proba we use sklearn's CalibratedClassifierCV (with numpy arrays).
To implement multi class SVC, we could use sklearn.multiclass meta estimators.

Pros of using sklearn as it is:

Works out of the box if we provide numpy input and configure the output type to be the same.
For SVC, it is very fast to implement missing features this way.
The performance is also fine for SVC (memory copies are insignificant compared to the O(n_rows^2)-O(n_rows^3) training cost).

Cons:

Would be nicer to stay with the data on device.
For models other than SVC this might matter for performance.
Conflicts with the following issue: Remove direct imports from sklearn #2467.

Questions:

Short term: is it ok to go forward with multiclass SVC by using sklearn.multiclass (numpy input), or is there a strong objection adding more direct imports from sklearn?
On the medium/long run, how do we plan to support device arrays with these meta estimators? One can think of a solution analogous to the sklearn-based preprocessing PR #2645.

JohnZed commented 4 years ago

The blog is a good example that we can get good performance with this approach. (@beckernick probably has additional thoughts on challenges he saw along the way)

For the SVC example, I think that starting with the SVC meta estimator seems like a good approach to me, as long as it's getting a strong speedup - I think it should be an empirical question. Agreed that it seems likely to be a minor consideration at most.

My only concern is that users should be able to pass in cudf and gpuarray/cuPy-style data seamlessly. Can those arrays be passed in now or will they generate an error with the meta estimator? If they are generating an error currently, then we may need to add a wrapper to allow these datatypes to be used here too.

tfeher commented 4 years ago

The sklearn meta estimators require that our models have numpy output arrays. Many of them also need the input as numpy array. (At least those that I have tested). For SVC I needed to wrap the meta estimator into type conversion statements to ensure that the input/output array types work as expected.

beckernick commented 4 years ago

Nowadays, we're in a pretty good place compatibility wise and most of the challenges have been resolved 😄 .

I agree with both of you and think that in the short term it's worth relying on input/output type configurability and paying the transfer costs.

Inputs that go through the scikit-learn code path for validating data will usually (but not always) hit a call to np.asarray, forcing CPU inputs (and thus outputs for chained calls). Adoption of NEP-35 and/or related NEPs would likely help resolve that problem, but I'm not aware of how much work that would entail in the scikit-learn codebase or how much benefit it would ultimately provide out of the box.

The performance is also fine for SVC (memory copies are insignificant compared to the O(n_rows^2)-O(n_rows^3) training cost

This is generally consistent with what I've seen for other estimators as well. For the complex models, its easily negated by the time taken by the estimator fit/predict calls. And for the simpler models, this still only adds a small amount of absolute time.

On the medium/long run, how do we plan to support device arrays with these meta estimators? One can think of a solution analogous to the sklearn-based preprocessing PR #2645.

In the medium/longer term, I think we should consider building our own with concurrent streams in mind. I think perhaps the larger value-add of creating our own functions for meta-estimators and cross-validators is not necessarily from eliminating D/H transfers but from enabling overlapping various fit/predict kernels across streams to maximize utilization. Today, when we go through a scikit-learn meta-estimator or cross-validator each call to fit or predict is blocking. Since many meta-estimators and much of HPO is embarrassingly parallel, if estimators don't fully utilize the GPU we can pay a penalty compared to distributing across 20-30 CPU cores if the dataset is not large enough.

I suspect with concurrent streams these would still be blocking in the scikit-learn cross-validator/meta-estimator world, but potentially non-blocking in a future cuML version. Keeping the GPU at peak utilization would be immensely valuable.

github-actions[bot] commented 3 years ago

This issue has been marked rotten due to no recent activity in the past 90d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

rapidsai / cuml

[FEA] scikit-learn based meta estimators #2876