Closed karthikkunala closed 4 years ago
@janhurst has implemented OneHotEncoder using sklearn which is a good solution. It gets converted to numpy array and then we need to rename a column and also convert back to pandas.
pd.get_dummies and OneHotEncoder are kinda the same. i'm being lazy with the rename stuff but you will need the same logic with get_dummies if you want to rename the columns.
I think there's an additional element to the pandas method, in terms of the default output but can't recall right now- https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons
I had a play and changed to using get_dummies here https://github.com/janhurst/capstone/blob/jan/notebooks/02-data-cleaning.ipynb
Its appending the labels of categories by default which is nice and what I was trying to preserve. OneHotEncoder is nice in a pipeline, but we kinda want to inspect the results and manipulate a bit by hand
Yeah, we will compare and implement whichever is good and simple.
I am exploring different feature engineering methods for the categorical variable. I am currently reviewing the article https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159 and implementing the best for our dataset.
@janhurst has implemented OneHotEncoder using sklearn which is a good solution. It gets converted to numpy array and then we need to rename a column and also convert back to pandas.
I found a function to do the same process in pandas which is get_dummies, its just one line of code and does the complex task. Kindly have a review and if it suits well we can consider this for the final model.
pd.get_dummies(TBI)
To have a look Meanwhile, I will explore other methods.