VotingClassifier estimators_ are based on transformed labels

ageron commented 5 years ago

Description

The VotingClassifier transforms the labels before training the sub-estimators, so if you try to use them directly for predictions or scoring, you get unexpected results. IMHO, this should either be fixed (but I'm guessing it's a performance optimization) or at least the documentation should warn about this fact.

Steps/Code to Reproduce

from __future__ import print_function
import numpy as np
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

X_train = np.array([[1., 2.], [3., 4.], [4., 3.], [2., 1.]])
y_train = np.array(['a', 'b', 'b', 'a'])

log_reg = LogisticRegression(multi_class="ovr", solver="liblinear")
log_reg.fit(X_train, y_train)
print("LogisticRegression score:", log_reg.score(X_train, y_train))
print("LogisticRegression predict:", log_reg.predict(X_train))

svc = SVC(gamma="auto")
svc.fit(X_train, y_train)
print("SVC score:", svc.score(X_train, y_train))
print("SVC predict:", svc.predict(X_train))

voting_clf = VotingClassifier([
    ("log_reg", LogisticRegression(multi_class="ovr", solver="liblinear")),
    ("svc", SVC(gamma="auto")),
])
voting_clf.fit(X_train, y_train)
print("VotingClassifier score:", voting_clf.score(X_train, y_train))
print("VotingClassifier predict:", voting_clf.predict(X_train))

#So far so good, now this is unexpected:
for index, estimator in enumerate(voting_clf.estimators_):
    print("Estimator", index, "score:", estimator.score(X_train, y_train))
    print("Estimator", index, "predict:", estimator.predict(X_train))

Expected Results

I would expect the sub-estimators to produce the same score and predictions as an equivalent classifier trained outside of the VotingClassifier.

LogisticRegression score: 0.5
LogisticRegression predict: ['b' 'b' 'b' 'b']
SVC score: 1.0
SVC predict: ['a' 'b' 'b' 'a']
VotingClassifier score: 1.0
VotingClassifier predict: ['a' 'b' 'b' 'a']
Estimator 0 score: 0.5
Estimator 0 predict: ['b' 'b' 'b' 'b']
Estimator 1 score: 1.0
Estimator 1 predict: ['a' 'b' 'b' 'a']

Actual Results

LogisticRegression score: 0.5
LogisticRegression predict: ['b' 'b' 'b' 'b']
SVC score: 1.0
SVC predict: ['a' 'b' 'b' 'a']
VotingClassifier score: 1.0
VotingClassifier predict: ['a' 'b' 'b' 'a']
Estimator 0 score: 0.0
Estimator 0 predict: [1 1 1 1]
Estimator 1 score: 0.0
Estimator 1 predict: [0 1 1 0]
/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/metrics/classification.py:182: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  score = y_true == y_pred
/Users/ageron/.virtualenvs/ml/lib/python3.6/site-packages/sklearn/metrics/classification.py:182: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  score = y_true == y_pred

Versions

Darwin-17.7.0-x86_64-i386-64bit
Python 3.6.6 (default, Jun 28 2018, 05:43:53)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]
NumPy 1.15.2
SciPy 1.1.0
Scikit-Learn 0.20.0

samwaterbury commented 5 years ago

I think there's a problem in your "Steps/Code to Reproduce" section, fourth code block, you are printing the log_reg predictions a second time instead of the SVC results. After correcting it I got:

SVC score: 1.0
SVC predict: ['a' 'b' 'b' 'a']

which is the same as what the voting classifier's SVC is predicting. Does this explain the issue?

If not, I did find this, which could potentially be what's happening: https://github.com/scikit-learn/scikit-learn/issues/11263

ageron commented 5 years ago

Hi @samwaterbury , good catch, thanks, I fixed the code above. However, the issue remains: the subestimators expect labels as integers, not strings. It does not seem related to #11263.

jnothman commented 5 years ago

This is done to facilitate alignment in predict_proba etc, not for efficiency (though there are conceivable benefits there).

I don't see it as a big problem, though:

it could be better documented
it would be a troublesome issue if we allowed for pre-fitted estimators

Changing it would be hard to do without breaking backwards compatibility.

samwaterbury commented 5 years ago

It's worth pointing out that VotingClassifier gives access to its internal label encoder via the attribute le_, however this is not documented.

ageron commented 5 years ago

Thanks @jnothman and @samwaterbury. I agree, it's not a big issue, probably just a sentence or two to add in the documentation, including the le_ tip. I won't be available in the next ~3 weeks, but I can take care of this if it's not done by then.

scikit-learn / scikit-learn