Matt,

In [24]: import pandas as pd import numpy as np In [25]: Iris_data = pd.read_csv('iris.csv', header=None)

Up to this point you imported the data into a pandas data frame. It is useful for me to see you used header=None to capture the first line of the data that would have been read as indexes if not.

In [26]: #separating the 4 feature columns X from the classification column y X = Iris_data.iloc[0:150,0:4] y = Iris_data.iloc[0:150,4]

print X

print y

This seems like a simple way to define X and y, by pin-pointing their locations on the data frame.

In [27]: #Problem 1: Implement KNN classification, using the sklearn package from sklearn.cross_validation import train_test_split from sklearn.neighbors import KNeighborsClassifier

Looks good. Grouping the from import statements make a lot of sense.

In [28]: #implementing KNN classification function def knn(X_train, X_test, y_train, y_test,k): myknn = KNeighborsClassifier(k).fit(X_train,y_train) return myknn.score(X_test, y_test)

This is interesting. What I think is happening is you are defining knn to include k as the classifier variable? I'm not sure what this value does the myknn tho. Return means you are going back to the calling function which is def knn.

In [29]:

testing knn function

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20,random_state=0) print knn(X_train, X_test, y_train, y_test,3) 0.966666666667

Looks good. So, you are not using myknn as last statement because you used the return statement.

In [30]: #Problem 2: Implement crossvalidation for your KNN classifier. from sklearn.cross_validation import KFold

Importing next step.

In [31]: kf = KFold(len(y), n_folds=5, shuffle=True) for train, test in kf:

calling function knn from prob1 for each of 5 folds

print knn(X.iloc[train],X.iloc[test],y.iloc[train],y.iloc[test],3)

0.933333333333 0.966666666667 1.0 0.966666666667 0.966666666667

You are here defining your K fold as kf, and using 5 folds for your cross-validation. Ah! That calling function is awesome and ties back to your style for defining X and y from the data frame.

In [70]: #Problem 3: Use your KNN classifier and crossvalidation code from (1) and (2) above to

determine the optimal value of K (number of nearest neighbors to consult)

Sorry, I need more time on this one. It looks good, I want to practice these shortcuts.

In [68]:

problem 4: Using matplotlib, plot classifier accuracy versus the hyperparameter K for a range

of K that you consider interesting

Makes sense. Good job!

@ghego, @craigsakuma, @kebaler

mattlichti / gen_assembly

HW2 peer review #2