In [24]: import pandas as pd
import numpy as np
In [25]: Iris_data = pd.read_csv('iris.csv', header=None)
Up to this point you imported the data into a pandas data frame. It is useful for me to see you used header=None to capture the first line of the data that would have been read as indexes if not.
In [26]: #separating the 4 feature columns X from the classification column y
X = Iris_data.iloc[0:150,0:4]
y = Iris_data.iloc[0:150,4]
print X
print y
This seems like a simple way to define X and y, by pin-pointing their locations on the data frame.
In [27]: #Problem 1: Implement KNN classification, using the sklearn package
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
Looks good. Grouping the from import statements make a lot of sense.
In [28]: #implementing KNN classification function
def knn(X_train, X_test, y_train, y_test,k):
myknn = KNeighborsClassifier(k).fit(X_train,y_train)
return myknn.score(X_test, y_test)
This is interesting. What I think is happening is you are defining knn to include k as the classifier variable? I'm not sure what this value does the myknn tho. Return means you are going back to the calling function which is def knn.
You are here defining your K fold as kf, and using 5 folds for your cross-validation.
Ah! That calling function is awesome and ties back to your style for defining X and y from the data frame.
In [70]: #Problem 3: Use your KNN classifier and crossvalidation code from (1) and (2) above to
determine the optimal value of K (number of nearest neighbors to consult)
Sorry, I need more time on this one. It looks good, I want to practice these shortcuts.
In [68]:
problem 4: Using matplotlib, plot classifier accuracy versus the hyperparameter K for a range
Matt,
In [24]: import pandas as pd import numpy as np In [25]: Iris_data = pd.read_csv('iris.csv', header=None)
Up to this point you imported the data into a pandas data frame. It is useful for me to see you used header=None to capture the first line of the data that would have been read as indexes if not.
In [26]: #separating the 4 feature columns X from the classification column y X = Iris_data.iloc[0:150,0:4] y = Iris_data.iloc[0:150,4]
print X
print y
This seems like a simple way to define X and y, by pin-pointing their locations on the data frame.
In [27]: #Problem 1: Implement KNN classification, using the sklearn package from sklearn.cross_validation import train_test_split from sklearn.neighbors import KNeighborsClassifier
Looks good. Grouping the from import statements make a lot of sense.
In [28]: #implementing KNN classification function def knn(X_train, X_test, y_train, y_test,k): myknn = KNeighborsClassifier(k).fit(X_train,y_train) return myknn.score(X_test, y_test)
This is interesting. What I think is happening is you are defining knn to include k as the classifier variable? I'm not sure what this value does the myknn tho. Return means you are going back to the calling function which is def knn.
In [29]:
testing knn function
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20,random_state=0) print knn(X_train, X_test, y_train, y_test,3) 0.966666666667
Looks good. So, you are not using myknn as last statement because you used the return statement.
In [30]: #Problem 2: Implement crossvalidation for your KNN classifier. from sklearn.cross_validation import KFold
Importing next step.
In [31]: kf = KFold(len(y), n_folds=5, shuffle=True) for train, test in kf:
calling function knn from prob1 for each of 5 folds
0.933333333333 0.966666666667 1.0 0.966666666667 0.966666666667
You are here defining your K fold as kf, and using 5 folds for your cross-validation. Ah! That calling function is awesome and ties back to your style for defining X and y from the data frame.
In [70]: #Problem 3: Use your KNN classifier and crossvalidation code from (1) and (2) above to
determine the optimal value of K (number of nearest neighbors to consult)
Sorry, I need more time on this one. It looks good, I want to practice these shortcuts.
In [68]:
problem 4: Using matplotlib, plot classifier accuracy versus the hyperparameter K for a range
of K that you consider interesting
Makes sense. Good job!
@ghego, @craigsakuma, @kebaler