janhurst / unisa-tbi

Decision Support Tool for suspected Traumatic Brain Injuries
https://unisa-tbi.azurewebsites.net
1 stars 1 forks source link

Remove all NaNs from dataset #12

Closed janhurst closed 4 years ago

janhurst commented 4 years ago

Most machine learning techniques require all variables to have a value (i.e. no NaNs)

It is helpful to manually remove the NaNs, but we could use something like scikit-learn SimpleImputer and select the most frequent variable.

Imputation will work ok when there are only a few NaNs, but we have some columns that have large numbers of NaNs or where the most frequent might not make sense

karthikkunala commented 4 years ago

Below are the statistics of each variable missing records. Can we remove based on % of missing or based on its importance?

variables number of missing records % of records missing
Observed 2376 5.474654378
LocLen 2556 5.889400922
Race 3208 7.391705069
ActNorm 3335 7.684331797
Ethnicity 15966 36.78801843
Dizzy 15972 36.80184332
janhurst commented 4 years ago

@karthikkunala just edited your comment so it is easier to read, hope that is ok

janhurst commented 4 years ago

Something @chauhan-bhavya said got me thinking, and I think she has a really good idea :)

Seeing as we have almost all categorical data, if we set NaNs as a new category (i think she set =95? or 99?) then after the category is OneHotEncoded we will end up with an "Unknown" binary column. We could then simply drop these Unknown columns.

This will only really work for non binary variables as OneHotEncoding a binary doesn't really help (because you should really drop one and you just end up with the same data).

I'm not sure how many variables this applies to but I'm going to work on it at some point over the weekend.