rasbt / python-machine-learning-book

The "Python Machine Learning (1st edition)" book code repository and info resource
MIT License
12.24k stars 4.4k forks source link

Numpy Future Warning when using plot_decision_regions function #22

Closed seth814 closed 8 years ago

seth814 commented 8 years ago

Sebastian,

I've been collecting my own data and have applied the plot_decision_regions function several times to my data but I am running into a problem with this new data. The problem is occurring here:

#plot class samples
    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
                    alpha=0.8, c=cmap(idx),
                    marker=markers[idx], label=cl)

My enumerated object is: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)] So 5 classifications hot encoded.

From what I understand, this list comprehension passes over my X_train_pca data five times and uses the boolean comparison y == cl to plot all my data points with five different colors as it passes through the markers and colormap.

Upon running, I get the warning:

FutureWarning: in the future, boolean array-likes will be handled as a boolean array index plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],

The really weird part is the values in the array: X[y==cl, 0] They now look like: [-0.4277726 -0.4277726 -0.44362509 ..., -0.4277726 -0.4277726 -0.4277726 ] With shape (9784,) which is the original length of my X_train_pca data. (I believe it should be closer to about a fifth since most of my data is similar in length and I checked np.shape after the loop ran.)

To give a visual my data looks like this.

image

When it should be separated into colors with a spread looking like this.

image

I can't really think through the problem anymore probably due to a misunderstanding of what this future warning is trying to tell me. I am wondering if you have any ideas as to what might cause this behavior.

rasbt commented 8 years ago

Hi, seth814, hope we can figure on what's going on here in your case. I think the easiest way would be if you could upload your script (if you are okay with sharing it) so that I can take a look and inspect what's going on inside the plot_decision_regions function that may cause this behavior on your dataset.

One thing that I could think of may be non-supported input shapes of the numpy arrays "X" and "y" in plot_decision_regions(X, y, classifier, resolution=0.02). This decision region plotting functions expects these X and y in the shape that scikit-learn works with. E.g., the y-array has to be a 1D integer type array. And X has to be a 2D float (or integer) type array.

Would be nice if you could check your input data and let me know what the result of the "print" functions (see below) looks like -- that would be very helpful

Input:

import numpy as np

y = np.array([1, 2, 0, 0, 2])
X = np.array([[1., 2.],  
              [3., 4.],
              [5., 6.],
              [8., 9.],
              [7., 8.]])

print('y:', y.shape, y.dtype)
print('X:', X.shape, X.dtype)

Output:

y: (5,) int64
X: (5, 2) float64

Above is an example of how the expected shape may look like.

PS: I have a slightly more sophisticated function implemented here: http://rasbt.github.io/mlxtend/user_guide/evaluate/plot_decision_regions/

I am currently a bit busy (at SciPy 2016), but several people asked me about 3D decision spaces recently, which I am going to add soon!

seth814 commented 8 years ago

No worries. I'm not in a huge hurry, but I am curious as to what is going on in the function.

I uploaded the data and file under Vertical Abduction in my repo. I tried to upload a zip but it said the format wasn't supported. The shapes and datatypes are both correct so it's probably something else.

rasbt commented 8 years ago

About the FutureWarning, I think that's not an issue here; it comes from the fact that y_train is a Pandas DataFrame, not a NumPy array. I'd just recommend putting a y_train = y_train.values into your code.

Hm, about the plot itself, I don't think this is a bug. This is how the decision region of the SVM looks like in this case -- you may want to do some hyperparameter tuning here. E.g., when I plot the first 2 dimensions of the input data, it kind of looks like this:

plt.scatter(X_train.values[:, 0], X_train.values[:, 1])

unknown

So, for a more visually pleasing analysis, you could maybe try a non-linear dimensionality reduction technique (e.g., Kernel SVM or other algorithms for manifold learning that are implemented in scikit-learn: http://scikit-learn.org/stable/modules/manifold.html)