jschulberg / Dog-Returns

A data science analysis to classify whether or not an adopted dog will be returned.
0 stars 0 forks source link

Classification: Logistic Regression #26

Closed jschulberg closed 2 years ago

rkelley05 commented 2 years ago

Confusion Matrix completed, working on code for plotting classification boundaries now.

jschulberg commented 2 years ago

Logistic Regression appeared to be an enticing classifier. If it proved to be accurate, we could leverage not just the prediction values, but also the associated probabilities (calculated from the log-odds ratios) of return. During our testing of the Logistic Regression classifier, we looked at various solvers and values of the inverse of regularization strength (C) to find the most fitting value.

After using our optimal values, we still only managed to achieve an accuracy score of ~70%, with many of our misclassified dogs being the ones we want to avoid (False Negatives).

rkelley05 commented 2 years ago

PCA classifier boundary plotting completed, but doesn't give great insight into classifier

jschulberg commented 2 years ago

Here's the equation for the logistic regression (where y = 1 / (1 + e^(-z))

y = 2.0427 +
-0.4804 * num_colors +
0.4013 * contains_black +
-0.0909 * contains_white +
-0.0566 * contains_yellow +
-0.5011 * contains_dark +
-0.304 * WEIGHT2 +
6.3962 * Age at Adoption (days) +
-0.1703 * is_retriever +
-0.263 * is_shepherd +
0.2814 * is_terrier +
0.3284 * is_husky +
0.0813 * is_other_breed +
0.5302 * num_behav_issues +
-0.4347 * puppy_screen +
0.6957 * needs_play +
-0.3203 * no_apartments +
0.7458 * energetic +
-0.6044 * shyness +
2.363 * needs_training +
-0.5881 * BULLY_SCREEN +
0.0347 * BULLY_WARNING +
0.2789 * OTHER_WARNING +
-0.2688 * CATS_LIVED_WITH +
-0.8674 * CATS_TEST +
0.0102 * KIDS_FIXED +
-2.1697 * DOGS_IN_HOME +
0.3511 * DOGS_REQ +
-0.0449 * has_med_issues +
0.1902 * demodex +
-0.0488 * leg_issues +
1.4557 * dental_issues +
0.1397 * weight_issues +
0.0274 * HW_FIXED
jschulberg commented 2 years ago

Exponentiated the above equation to get the weights of each coefficient. Here's the plot of the top 10: