janhurst / unisa-tbi

Decision Support Tool for suspected Traumatic Brain Injuries
https://unisa-tbi.azurewebsites.net
1 stars 1 forks source link

Better feature engineering methods #23

Closed karthikkunala closed 4 years ago

karthikkunala commented 4 years ago

I am exploring different feature engineering methods for the categorical variable. I am currently reviewing the article https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159 and implementing the best for our dataset.

@janhurst has implemented OneHotEncoder using sklearn which is a good solution. It gets converted to numpy array and then we need to rename a column and also convert back to pandas.

I found a function to do the same process in pandas which is get_dummies, its just one line of code and does the complex task. Kindly have a review and if it suits well we can consider this for the final model.

pd.get_dummies(TBI)

To have a look Meanwhile, I will explore other methods.

janhurst commented 4 years ago

@janhurst has implemented OneHotEncoder using sklearn which is a good solution. It gets converted to numpy array and then we need to rename a column and also convert back to pandas.

pd.get_dummies and OneHotEncoder are kinda the same. i'm being lazy with the rename stuff but you will need the same logic with get_dummies if you want to rename the columns.

doughnuted commented 4 years ago

I think there's an additional element to the pandas method, in terms of the default output but can't recall right now- https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons

janhurst commented 4 years ago

I had a play and changed to using get_dummies here https://github.com/janhurst/capstone/blob/jan/notebooks/02-data-cleaning.ipynb

Its appending the labels of categories by default which is nice and what I was trying to preserve. OneHotEncoder is nice in a pipeline, but we kinda want to inspect the results and manipulate a bit by hand

karthikkunala commented 4 years ago

Yeah, we will compare and implement whichever is good and simple.