j-adamczyk / Towards_Data_Science

Links to my Towards Data Science articles
MIT License
7 stars 2 forks source link

ValueError: Cannot index with multidimensional key #1

Closed koolmoo closed 4 years ago

koolmoo commented 4 years ago

Hi Jakub,

Thanks for providing your KNN implementation using faiss. I'm working with a large dataset (566602 rows × 20 columns) and KNeighborsClassifier took way too long, so I was hoping your implementation would help.

The problem is, I'm applying one-hot encoding to my categorical features and this seems to be causing the following error in the predict method of the classifier: ValueError: Cannot index with multidimensional key

To demonstrate this, here's a code sample that results in the same error on a different dataset:

import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

numeric_features = ['age', 'fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['embarked', 'sex', 'pclass']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', FaissKNeighbors())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)
clf.predict(X_test)

Here is the full stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-373ecdcebcf0> in <module>()
     29 
     30 clf.fit(X_train, y_train)
---> 31 clf.predict(X_test)

6 frames
/usr/local/lib/python3.6/dist-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
    114 
    115         # lambda, but not partial, allows help() to work with update_wrapper
--> 116         out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
    117         # update the docstring of the returned function
    118         update_wrapper(out, self.fn)

/usr/local/lib/python3.6/dist-packages/sklearn/pipeline.py in predict(self, X, **predict_params)
    418         for _, name, transform in self._iter(with_final=False):
    419             Xt = transform.transform(Xt)
--> 420         return self.steps[-1][-1].predict(Xt, **predict_params)
    421 
    422     @if_delegate_has_method(delegate='_final_estimator')

<ipython-input-24-c1e7d3808324> in predict(self, X)
     15     def predict(self, X):
     16         distances, indices = self.index.search(X.astype(np.float32), k=self.k)
---> 17         votes = self.y[indices]
     18         predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
     19         return predictions

/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in __getitem__(self, key)
    908             key = check_bool_indexer(self.index, key)
    909 
--> 910         return self._get_with(key)
    911 
    912     def _get_with(self, key):

/usr/local/lib/python3.6/dist-packages/pandas/core/series.py in _get_with(self, key)
    941         if key_type == "integer":
    942             if self.index.is_integer() or self.index.is_floating():
--> 943                 return self.loc[key]
    944             else:
    945                 return self._get_values(key)

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in __getitem__(self, key)
   1766 
   1767             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768             return self._getitem_axis(maybe_callable, axis=axis)
   1769 
   1770     def _is_scalar_access(self, key: Tuple):

/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1950 
   1951                 if hasattr(key, "ndim") and key.ndim > 1:
-> 1952                     raise ValueError("Cannot index with multidimensional key")
   1953 
   1954                 return self._getitem_iterable(key, axis=axis)

ValueError: Cannot index with multidimensional key

As you can see, the error is happening on line 17 at votes = self.y[indices]. I have tested the implementation on the Iris dataset without any preprocessing and it works fine, so I believe it's related to the one-hot encoding. Please let me know if you have a fix or if one-hot encoding is not necessary for this implementation. Thanks again!

j-adamczyk commented 4 years ago

Hi! Judging from the stack trace, the error that you're getting is the indexing error in Pandas. One of the arrays probably has too many dimensions. Can you try to print: shape of self.y, few first lines of self.y, few first lines of indices, types of self.y and indices?

On a side note, you have quite a lot data. If the IndexFlatL2 will be taking too long, try some other approaches, e. g. IndexLSH.

j-adamczyk commented 4 years ago

I've found the error. The problem is that self.y is a Pandas Series object and does not support the Numpy-style indexing with self.y[indices]. The easiest way to fix this is to change the .fit() method like this:

        self.y = np.array(self.y, dtype=np.int)

This way we get the y as a Numpy array and the error is no more. This way the additional data the Series object is carrying is lost though. If you care about it, you can convert it every time in the .predict() method:

        tmp_y = np.array(self.y, dtype=np.int)
        votes = tmp_y[indices]

Beware that this is inefficient though. The best option is a space-time tradeoff, where we keep 2 copies of y - one as a Numpy array for internal use and one as a original Pandas Series. The full code for it:

class FaissKNeighbors:
    def __init__(self, k=5):
        self.index = None
        self.y = None
        self._y_np = None
        self.k = k

    def fit(self, X, y):
        self.index = faiss.IndexFlatL2(X.shape[1])
        self.index.add(X.astype(np.float32))
        self.y = y
        self._y_np = np.array(y, dtype=np.int)

    def predict(self, X):
        distances, indices = self.index.search(X.astype(np.float32), k=self.k)
        votes = self._y_np[indices]
        predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
        return predictions
koolmoo commented 4 years ago

Thanks Jakub! That indeed fixed the issue and I was able to run the code sample posted above successfully.

Unfortunately, this didn't translate well to my actual dataset and I got an issue in fit: AssertionError: assert x.flags.contiguous. I first tried to solve this with np.ascontiguousarray as suggested here: https://github.com/facebookresearch/faiss/issues/459. That didn't work so I printed X and realized that the one-hot encoder was outputting a sparse matrix of type csr_matrix rather than a ndarray. After setting OneHotEncoder(sparse=False), I was able to fit and predict on my dataset without issue.

The one thing I'm still unable to figure out is why it doesn't work with cross_validate, but that's most likely due to incompatible data types between the y_test and y_pred arrays. I'll let you know if I find any other issues or possible enhancements. Thanks!