rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.23k stars 532 forks source link

[BUG] LinearSVC predict returns float rather than integer data (breaking scikit-learn's VotingClassifier) #5946

Open cbilot opened 4 months ago

cbilot commented 4 months ago

Similar to what is described in #5637, LinearSVC predict returns a float dtype rather than integer, which breaks compatibility with third-party tools that assume classifier predictions are integers and operate on them accordingly.

import cuml
from sklearn.datasets import make_classification 

X, y = make_classification()

clf = cuml.ensemble.RandomForestClassifier().fit(X, y)
print(clf.predict(X[:5]).dtype)

clf = cuml.linear_model.LogisticRegression().fit(X, y)
print(clf.predict(X[:5]).dtype)

clf = cuml.svm.SVC().fit(X, y)
print(clf.predict(X[:5]).dtype)

clf = cuml.neighbors.KNeighborsClassifier().fit(X, y)
print(clf.predict(X[:5]).dtype)

clf = cuml.svm.LinearSVC().fit(X, y)
print(clf.predict(X[:5]).dtype)
int64
int64
int64
int64
float64

For example, with Scikit's VotingClassifier, we get the following error:


from sklearn.ensemble import VotingClassifier
v_clf = VotingClassifier(
    estimators=[
        ("svc", cuml.svm.SVC()),
        ("linear_svc", cuml.svm.LinearSVC()),
    ],
)
v_clf.fit(X, y)
v_clf.predict(X)
>>> v_clf.predict(X)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/corey/.virtualenvs/wine-cultivator/lib/python3.11/site-packages/sklearn/ensemble/_voting.py", line 444, in predict
    maj = np.apply_along_axis(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/corey/.virtualenvs/wine-cultivator/lib/python3.11/site-packages/numpy/lib/shape_base.py", line 379, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/corey/.virtualenvs/wine-cultivator/lib/python3.11/site-packages/sklearn/ensemble/_voting.py", line 445, in <lambda>
    lambda x: np.argmax(np.bincount(x, weights=self._weights_not_none)),
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

Environment Details:

dantegd commented 4 months ago

Thanks for the issue @cbilot, this is a very good catch and will work on a fix for the upcoming 24.08 release.