michaelschulerjr / DAT_SF_10

Repository for data science 10 course
1 stars 0 forks source link

HW 2 Review by Otto S #2

Open ostegm opened 9 years ago

ostegm commented 9 years ago

Hey Michael,

For the review of your HW, I'm going to be referencing different snippets of code by the labels on the "in" and "out", I'm using the numbers here

http://nbviewer.ipython.org/github/michaelschulerjr/DAT_SF_10/blob/master/Homework/HW2/HW_2.ipynb

Importing and Cleaning:

  1. Reference: "In [97]:" I noticed you imported the whole file (including the nan row) and then removed it later. That's great, another way to do it is to simply import only 150 rows of the using code like this:
df = pd.read_csv('iris.csv', names = column_labels, nrows =150)

Both ways work - just figured I'd give you one more tool in the toolbox!

  1. Great job on converting the classifiers to integers. That took me a while, but looks like you got that part easily. Many ways to do this step as well. I did it slightly differently - converting the string labels into intergers within a column of the dataframe. Not sure which is better, but both work!
#including an extra empty column to hold integer values for each class
column_labels = ['Sepal Length','Sepal Width', 'Petal Length', 'Petal Width', 'Class', 'Class_label']
df = pd.read_csv('iris.csv', names = column_labels, nrows =150)
#Set up the class_label column to hold integer values for each class
df['Class_label'] = df['Class']
df.Class_label.replace(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],[1,2,3],inplace=True)

Step 1 - Implementing KNN Scoring

Nothing to say here- looks like you did it well.

Step 2 Implement Cross Validation Using KFolds

Same as above, no real comments here

Step 3. Determine IDeal Value for K

I think you're code got hung up in the list comprehension. You had the idea right, but I think it has to be split onto multiple lines to allow for calculating the score for each item and then appending it to the list. Here's a way you could make it work:

knn_scores = []
for i in range(1, 151, 2):
    score = crossValidate(X, y, KNeighborsClassifier(i).fit, 5)
    knn_scores.append([i,score])

#turn the resulting list into a dataframe for viewing    
cross_val = pd.DataFrame(knn_scores, columns=["neighbors", "scores"])    

#find the max in the dataframe
max_scores = cross_val[cross_val['scores'] == cross_val['scores'].max()]
max_scores

Step 4 Plot Results of Increasing Values of K:

I think you were close - but because your code in step 3 didnt work, the plot was not correct. Below is the code I used to plot the results of step 3. I found this plotting portion easy because I had already made the data into a dataframe with two columns (neighbors and scores):

%matplotlib inline
#Plotting the results of testing K values between 1 and 99
plt.plot(cross_val.neighbors, cross_val.scores)
plt.title('Plot of Accuracy with increasing values of K')

Bonus: Is there an optimal number of folds

I think this would be pretty easy for you to figure out now. Essentially, its a repeat of steps 3 and 4 above, but instead of varying the number of neighbors, keep neighbors static and run a for loop testing the number of folds.

Happy to post the code I used if you want it. Just let me know: o.stegmaier@gmail.com

Email me with any questions!

Thanks, Otto

@michaelschulerjr, @ghego, @craigsakuma, @kebaler

michaelschulerjr commented 9 years ago

Hey Otto,

Thanks for all the feedback! I'm going to try to rerun my code with some of these suggestions.

See you tomorrow