So... some comments, in chronological order running through your homework.
I'm not sure what happened with loading the data via CSV - good work, around, but ideally you should be able to just load it directly into pandas. Let me know if that's a common problem you're running into - I can try to help debug it.
Your application of K-means and cross-validation is great. You arrived at the solution , however I wanted to point out a few subtle points. It looks like you used the code from class (which is awesome) - but be sure to think about what the inputs are. For example, you looped through the list of odd integers from 1 to 51.
n_neighbors = range(1, 51, 2)
Why 51? That was the number of observations we had in the demo example. In this case 51 isn't a bad choice, but I wanted to make sure it was conscious. In this data set we have 178 observations, so technically you could go to 177, although I think 51 is a better choice.... Anyway, small point, but wanted to make sure you understood why we used 51.
When you chose the number of neighbors (27) I see that you based this off of the graph, which show the score maxed around 27. Remember that this graph was built off of only one slice of the data If you were to run the code below (with a random seed of 1, you'd find a different value for the "optimal" K.
The point is, be careful picking K values based off of a random slice of data. Sometimes you can end up overfitting the model based on a single slice of the data. Another way to do this would be to fit the model, and the score based on cross-validation before choosing your K value.
In this case, 27 isn't BAD - I just wanted to highlight that your choice of K was based off of only one portion of the data.
This is kind of bonus material, but did you notice that proline is of a bigger magnitude than all the other variables? It ranges from 0-168 when some of the other variables a much much smaller. This scale problem over amplifies the effect of proline. If you scale the data. you can get an accuracy of 96%.
from sklearn.preprocessing import StandardScaler
features_scalar = StandardScaler()
X_train_scaled = features_scalar.fit_transform(X_train)
from sklearn import neighbors
clf_scaled = neighbors.KNeighborsClassifier(3, weights='uniform')
clf_scaled.fit(X_train_scaled, y_train)
For part two, clustering take a look at the solution set and let me know if you have questions. I sort of see the direction you were going by choosing the top to features, but you can actually do clustering with the full dataset - its just harder to visualize.
@git-halvorson
Hey Dave,
I've been running through your HW2 submission and wanted to give you some feedback.
Firstly - the "official" solutions are posted now. You should be able to do a git pull and find those for review. They're also located here: https://github.com/ga-students/DAT_SF_12/tree/gh-pages/Solutions
So... some comments, in chronological order running through your homework.
Why 51? That was the number of observations we had in the demo example. In this case 51 isn't a bad choice, but I wanted to make sure it was conscious. In this data set we have 178 observations, so technically you could go to 177, although I think 51 is a better choice.... Anyway, small point, but wanted to make sure you understood why we used 51.
The point is, be careful picking K values based off of a random slice of data. Sometimes you can end up overfitting the model based on a single slice of the data. Another way to do this would be to fit the model, and the score based on cross-validation before choosing your K value.
In this case, 27 isn't BAD - I just wanted to highlight that your choice of K was based off of only one portion of the data.
Let me know if you have questions!
Thanks