h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 2k forks source link

Clarify toNumeric on categorical columns #15119

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Currently toNumeric() looks at the first item in the domain list. If that item can be interpreted as an integer, then the domain values are used to create the resulting numeric vector. If that item cannot be read as an integer, the enumeration levels are used to create the resulting vector. While occasionally handy, this inconsistency is going to be magical as to how it works for many users. Instead the return value should always be the enumeration levels (ala R). To use the domain for the result, as.numeric(as.character(foo)) can should be used in R, with an optimization in the Rapids tree walker to collapse the two operations into a single call to VecUtils.categoricalDomainsToNumeric() (thus skipping an unneeded creation of a temporary string column.

VecUtils.categoricalDomainsToNumeric() should be made to handle either integer or real values.

This will be a user facing change and should be highlighted in whatever release it is a part of. It will also break regression tests that rely on it.

DinukaH2O commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2208 Assignee: Brandon Hill Reporter: Brandon Hill State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A