Closed jschulberg closed 2 years ago
KNN to impute missing values:
https://machinelearningmastery.com/knn-imputation-for-missing-values-in-machine-learning/
Computed missing values with KNN, decide if we should scale data prior to imputation or not.
@rkelley05 Here's the convo with Julie about where to impute NA's
SEX_Male -- 6th value (M for 'male' or F for 'female')
SEX_Female -- 6th value (M for 'male' or F for 'female')
multi_color -- Not a standard way of denoting the colors (kids just come up with their own values)
num_colors -- Not a standard way of denoting the colors (kids just come up with their own values)
MIX_BOOL -- Don't use this because every dog is a mix. Sometimes they put 'Mix' just to get them adopted faster.
contains_black -- Not a standard way of denoting the colors (kids just come up with their own values)
contains_white -- Not a standard way of denoting the colors (kids just come up with their own values)
contains_yellow -- Not a standard way of denoting the colors (kids just come up with their own values)
WEIGHT2 -- Impute with KNN
Age at Adoption (days) -- Impute with KNN
is_retriever
is_shepherd
is_other_breed
num_behav_issues
puppy_screen -- If it doesn't say puppy screen, check the age. If it's less than 6 months old, it's a puppy.
new_this_week -- Delete this
needs_play
no_apartments -- Use imputation for this
energetic -- 0 if not specified in BEHAVIORAL NOTES shyness -- 0 if not specified in
BEHAVIORAL NOTES`
needs_training -- Use imputation for this
BULLY_SCREEN -- 0 if not specified
BULLY_WARNING -- 0 if not specified
OTHER_WARNING -- 0 if not specified
CATS_LIVED_WITH -- 1 if not specified, but could try imputation
CATS_TEST -- 1 if not specified (good with cats), but could try imputation
KIDS_FIXED -- Impute for missing values. Also unsure about how caution should be treated, so consider imputing those values as well
DOGS_IN_HOME -- 0 if not specified (if they don't know, they assume they're good with dogs)
DOGS_REQ -- 0 if not specified (if they don't know, they assume they're good with dogs)
has_med_issues
diarrhea -- REMOVE THIS (They all get diarrhea)
ehrlichia
uri -- REMOVE THIS
ear_infection
tapeworm -- REMOVE THIS
general_infection -- REMOVE THIS
demodex (skin condition)
car_sick -- 0 if not specified
dog_park -- REMOVE THIS (not consistent)
leg_issues
anaplasmosis
treated_vaccinated -- REMOVE THIS
HW_FIXED
FT_FIXED -- REMOVE THIS
spay_neutered -- REMOVE THIS (all dogs are spayed/neutered)
So there are a few columns that we're trying to use for predictive purposes that have NAs in them. Maybe we can use a K-Nearest Neighbor algorithm to predict what their value should be?
Total Records = 10489
Number of NAs by column:
multi_color = 568 num_colors = 568 contains_black = 568 contains_white = 568 contains_yellow = 568 MIX_BOOL = 131 WEIGHT2 = 538 Age at Adoption (days) = 4687