Closed koolmoo closed 4 years ago
Hi!
Judging from the stack trace, the error that you're getting is the indexing error in Pandas. One of the arrays probably has too many dimensions. Can you try to print: shape of self.y
, few first lines of self.y
, few first lines of indices
, types of self.y
and indices
?
On a side note, you have quite a lot data. If the IndexFlatL2
will be taking too long, try some other approaches, e. g. IndexLSH
.
I've found the error. The problem is that self.y
is a Pandas Series object and does not support the Numpy-style indexing with self.y[indices]
.
The easiest way to fix this is to change the .fit()
method like this:
self.y = np.array(self.y, dtype=np.int)
This way we get the y
as a Numpy array and the error is no more. This way the additional data the Series object is carrying is lost though. If you care about it, you can convert it every time in the .predict()
method:
tmp_y = np.array(self.y, dtype=np.int)
votes = tmp_y[indices]
Beware that this is inefficient though. The best option is a space-time tradeoff, where we keep 2 copies of y
- one as a Numpy array for internal use and one as a original Pandas Series. The full code for it:
class FaissKNeighbors:
def __init__(self, k=5):
self.index = None
self.y = None
self._y_np = None
self.k = k
def fit(self, X, y):
self.index = faiss.IndexFlatL2(X.shape[1])
self.index.add(X.astype(np.float32))
self.y = y
self._y_np = np.array(y, dtype=np.int)
def predict(self, X):
distances, indices = self.index.search(X.astype(np.float32), k=self.k)
votes = self._y_np[indices]
predictions = np.array([np.argmax(np.bincount(x)) for x in votes])
return predictions
Thanks Jakub! That indeed fixed the issue and I was able to run the code sample posted above successfully.
Unfortunately, this didn't translate well to my actual dataset and I got an issue in fit
: AssertionError: assert x.flags.contiguous
. I first tried to solve this with np.ascontiguousarray
as suggested here: https://github.com/facebookresearch/faiss/issues/459. That didn't work so I printed X
and realized that the one-hot encoder was outputting a sparse matrix of type csr_matrix
rather than a ndarray
. After setting OneHotEncoder(sparse=False)
, I was able to fit and predict on my dataset without issue.
The one thing I'm still unable to figure out is why it doesn't work with cross_validate
, but that's most likely due to incompatible data types between the y_test
and y_pred
arrays. I'll let you know if I find any other issues or possible enhancements. Thanks!
Hi Jakub,
Thanks for providing your KNN implementation using faiss. I'm working with a large dataset (566602 rows × 20 columns) and KNeighborsClassifier took way too long, so I was hoping your implementation would help.
The problem is, I'm applying one-hot encoding to my categorical features and this seems to be causing the following error in the
predict
method of the classifier:ValueError: Cannot index with multidimensional key
To demonstrate this, here's a code sample that results in the same error on a different dataset:
Here is the full stack trace:
As you can see, the error is happening on line 17 at
votes = self.y[indices]
. I have tested the implementation on the Iris dataset without any preprocessing and it works fine, so I believe it's related to the one-hot encoding. Please let me know if you have a fix or if one-hot encoding is not necessary for this implementation. Thanks again!