String munging: toNA - Githubissues

h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Apache License 2.0

6.91k stars 2k forks source link

I'm not sure that there is a clear parallel for this method in either R or Pandas. Take the airlines dataset as an example. This dataset has about 5 different spellings for the word "unknown". Through time they kept changing the spelling. Ideally a user wants all of these to become NA. toNA() should take a string or list of strings and return a new column where all entries matching those strings perfectly should be turned to NAs.

For a look at other string related methods in Java see: h2o-core/src/main/java/water/ASTStrOp.java and then h2o-core/src/main/java/water/fvec/CStrChunk.java for accelerated versions of methods when the string column is pure ASCII.

h2oai / h2o-3

String munging: toNA #9495