TypeError: list indices must be integers, not tuples; when using score() or predict() on a fitted EnsembleVoteClassifier

rasbt / mlxtend

A library of extension and helper modules for Python's data analysis and machine learning libraries.

https://rasbt.github.io/mlxtend/

Other

4.91k stars 872 forks source link

TypeError: list indices must be integers, not tuples; when using score() or predict() on a fitted EnsembleVoteClassifier #226

Closed erinversfeldcodes closed 7 years ago

erinversfeldcodes commented 7 years ago

I have two trained classifiers which I am constructing n EnsembleVoteClassifier. I want to gauge the accuracy of this classifier, and so expect to be able to call score() using the test split of my data. However, I am having some issues in this regard. Specifically that calls to score and predict throw the TypeError specified in the title of this issues. The code is given below. Any ideas as to how I can resolve this?

    pipe1 = make_pipeline(ColumnSelector(cols=(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)), classifier1)
    pipe2 = make_pipeline(ColumnSelector(cols=(15, 16, 17, 18, 19, 20, 21, 22)), classifier2)

    data = read_data(step=40)
    X = data[0]
    Y = data[1]
    x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=0)

    ensemble = EnsembleVoteClassifier(clfs=[pipe1, pipe2], voting='soft', weights=[1, 1], verbose=2, refit=False)
    ensemble.fit(list(x_train), list(y_train))  # this was giving the same error until I parsed x_train and y_train as lists
    ensemble.predict(list(x_test))  # produces the error
    ensemble.score(list(x_test), list(y_test))  # produces the error

rasbt commented 7 years ago

Hm, I suspect that your data is maybe not in the expected [n_samples, n_features] format? In scikit-learn and also here, X_ should be a [n_samples, n_features] array, and y_ should be a [n_samples, ] array.

After executing x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=0), can you please provide the output of the following

for d in (x_train, x_test, y_train, y_test):
    print(d.shape)

This would help a lot with better understanding what's going on.

erinversfeldcodes commented 7 years ago

I get AttributeError: 'list' object has no attribute 'shape'

erinversfeldcodes commented 7 years ago

But after casting everything to a numpy array like so:

for d in (x_train, x_test, y_train, y_test):
    print(np.array(d).shape)

I get the following:

(93L, 858L)
(32L, 858L)
(93L,)
(32L,)

And if I then cast everything to a numpy array instead of a list, I get:

ValueError: shapes (32, 14) and (1386, 103) not aligned: 14 (dim 1) != 1386 (dim 0)

Which is at least a different error, if a little annoying...

rasbt commented 7 years ago

Thanks for the extra info. I have a suspicion of what might be the problem, but could you share more details about the error message? E.g., the full error stack you are/were getting so that I could trace down the line of code that throws this error?

PS: I think this is a problem that is not unique to the EnsembleVoteClassifier. If you e.g., replace

    ensemble = EnsembleVoteClassifier(clfs=[pipe1, pipe2], voting='soft', weights=[1, 1], verbose=2, refit=False)
    ensemble.fit(list(x_train), list(y_train))  # this was giving the same error until I parsed x_train and y_train as lists
    ensemble.predict(list(x_test))  # produces the error
    ensemble.score(list(x_test), list(y_test))  # produces the error

    from sklean.linear_model import LogisticRegression

    pipe2 = make_pipeline(ColumnSelector(cols=(15, 16, 17, 18, 19, 20, 21, 22)), LogisticRegression())
    pipe2.fit(list(x_train), list(y_train))  # this was giving the same error until I parsed x_train and y_train as lists
    pipe2.predict(list(x_test))  # produces the error
    pipe2.score(list(x_test), list(y_test))  # produces the error

Does the same error occur?

erinversfeldcodes commented 7 years ago

Traceback (most recent call last):
  File "[path]/HonoursProject/Myo/__init__.py", line 259, in <module>
    set_up()
  File "[path]/HonoursProject/Myo/__init__.py", line 76, in set_up
    ensemble = voting_ensemble_classifier(spatial_classifier, gestural_classifier)
  File "[path]\HonoursProject\Myo\ensemble_classifiers\voting.py", line 34, in voting_ensemble_classifier
    ensemble.score(np.array(x_test), np.array(y_test))
  File "[path]\lib\site-packages\sklearn\base.py", line 349, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "[path]\lib\site-packages\mlxtend\classifier\ensemble_vote.py", line 188, in predict
    maj = np.argmax(self.predict_proba(X), axis=1)
  File "[path]\lib\site-packages\mlxtend\classifier\ensemble_vote.py", line 221, in predict_proba
    avg = np.average(self._predict_probas(X), axis=0, weights=self.weights)
  File "[path]\lib\site-packages\mlxtend\classifier\ensemble_vote.py", line 263, in _predict_probas
    return np.asarray([clf.predict_proba(X) for clf in self.clfs_])
  File "[path]\lib\site-packages\sklearn\utils\metaestimators.py", line 54, in <lambda>
    out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
  File "[path]\lib\site-packages\sklearn\pipeline.py", line 377, in predict_proba
    return self.steps[-1][-1].predict_proba(Xt)
  File "[path]\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 1016, in predict_proba
    y_pred = self._predict(X)
  File "[path]\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 676, in _predict
    self._forward_pass(activations)
  File "[path]\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 104, in _forward_pass
    self.coefs_[i])
  File "[path]\lib\site-packages\sklearn\utils\extmath.py", line 189, in safe_sparse_dot
    return fast_dot(a, b)
ValueError: shapes (32,14) and (1386,103) not aligned: 14 (dim 1) != 1386 (dim 0)

rasbt commented 7 years ago

Hm, thanks, but this is not as helpful as I thought :P. For debugging purposes, can you try the LogisticRegression example suggested in my previous comment. And if that works, can you try using LogisticRegression instead of the MLPClassifier in your pipe_1 and pipe_2; maybe it's a bug in the MLP.

erinversfeldcodes commented 7 years ago

And yeah, I get the same error if I try score the pipe using a logistic regression.

Traceback (most recent call last):
  File "[path]/HonoursProject/Myo/__init__.py", line 257, in <module>
    set_up()
  File "[path]/HonoursProject/Myo/__init__.py", line 76, in set_up
    ensemble_accuracy = voting_ensemble_classifier(spatial_classifier, gestural_classifier)
  File "[path]\HonoursProject\Myo\ensemble_classifiers\voting.py", line 29, in voting_ensemble_classifier
    test_pipe.fit(list(x_train), list(y_train))
  File "[path]\lib\site-packages\sklearn\pipeline.py", line 268, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "[path]\lib\site-packages\sklearn\pipeline.py", line 234, in _fit
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "[path]\lib\site-packages\mlxtend\feature_selection\column_selector.py", line 44, in fit_transform
    return self.transform(X=X, y=y)
  File "[path]\lib\site-packages\mlxtend\feature_selection\column_selector.py", line 62, in transform
    t = X[:, self.cols]
TypeError: list indices must be integers, not tuple

It's entirely possible that my use case is the problem, rather than the framework. I'm trying to combine two classifiers, each trained on a different data set. Both data sets describe the same thing, but using different measurements. The one data set is smaller than the other. I'm trying to see if I can get a more accurate model by combining the two using the EnsembleVoteClassifier.

rasbt commented 7 years ago

Hm, both

fitting classifiers on different feature subsets
fitting classifiers on datasets with different numbers of samples

should work. E.g., see the examples below:

1

import numpy as np
from sklearn.datasets import load_iris
from mlxtend.classifier import EnsembleVoteClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

iris = load_iris()
X = iris.data
y = iris.target
idx = np.arange(X.shape[0])
np.random.shuffle(idx)
X, y = X[idx], y[idx]

pipe1 = make_pipeline(ColumnSelector(cols=[0, 1]), LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=[2, 3]), LogisticRegression())

ens = EnsembleVoteClassifier(clfs=[pipe1, pipe2])
ens.fit(X, y)
ens.score(X, y)

2

pipe1 = make_pipeline(ColumnSelector(cols=[0, 1]), LogisticRegression())
pipe1.fit(X[:100], y[:100])

pipe2 = make_pipeline(ColumnSelector(cols=[2, 3]), LogisticRegression())
pipe2.fit(X[:50], y[:50])

ens = EnsembleVoteClassifier(clfs=[pipe1, pipe2], refit=False)
ens.fit(X, y)
ens.score(X, y)

I am currently not seeing what the issue might be in your case, but maybe it's somehow related to the format of your dataset or so. You can try to run the example above on your dataset after setting X = x_train and y = y_train and see if that runs okay to get a better picture of what's going on.

erinversfeldcodes commented 7 years ago

it looks like this error was caused by the format of the dataset, as you suggested. when combining the data into a single csv Python was adding blank lines everywhere 🤦‍♀️

rasbt commented 7 years ago

Glad to hear that it's fixed now!