szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

Integer encoding for categorical variables in random forests in R #22

Closed zachmayer closed 8 years ago

zachmayer commented 8 years ago

This quote stuck out to me:

It cannot cope by default with a large number of categories, therefore the data had to be one-hot encoded.

Did you try integer-encoding categories? It looks like you did for python, maybe that's worth trying with R.

szilard commented 8 years ago

Yes, I played around, see the discussion here https://github.com/szilard/benchm-ml/issues/1

szilard commented 8 years ago

Closing this, but let me know if you have further questions.