Modify Dataframes - Githubissues

h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

Apache License 2.0

6.78k stars 1.99k forks source link

Currently, data frames can be split by random splits.

I would like the following types of splits:

Row based data frame split/reweighting a. by column (i.e. Split data frame by where column matches a criteria (=,>,>=,<,<=) b. able to determine weights of the split (either set to 0 to filter or sampling parameters)
Column based a. remove uneeded columns from the data frame

Use cases:

I am using hadoop and I want to export predictions. I only want row that where p1 > .10, also columns, x,y,z + the prediction data. As I type this, I waiting 4 hours to export a large dataset using a 220 node cluster.

h2oai / h2o-3

Modify Dataframes #12502