sassoftware / sas-viya-dmml-pipelines

Code examples and supporting materials for data mining and machine learning techniques on the SAS Viya environment.
Apache License 2.0
30 stars 26 forks source link

Question about creating dummies #2

Closed ruqianq closed 3 years ago

ruqianq commented 3 years ago

Hi, thanks for sharing this repo, and it is very helpful for me to integrate open source with SAS model studio.

I just have a general question about the model in sf_onehotvars_sklearn_randomforest.py. You use pd.get_dummies to encode your categorical data, what is the advantage of using get_dummies instead of OneHotEncoder from sklearn?

I guess this is more a general machine learning question, but I would love to hear the perspective from you guys. Thanks!

rmyneni commented 3 years ago

If I remember correctly, sklearn OneHotEncoder did not support string variables/columns in the past but it does now. So if you are primarily using sklearn for your machine learning work, I would just stick with sklearn and use OneHotEncoder.

ruqianq commented 3 years ago

Thank you so much! I have seen lots of discussion about how to properly encode your variables. Most people, like you said, agree that using OneHotEncoder is better. But from my experience, I found it is not easy to use comparing to get_dummies. Anyway, thank you for your comment.