h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.78k stars 1.99k forks source link

Modify Dataframes #12502

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Currently, data frames can be split by random splits.

I would like the following types of splits:

  1. Row based data frame split/reweighting a. by column (i.e. Split data frame by where column matches a criteria (=,>,>=,<,<=) b. able to determine weights of the split (either set to 0 to filter or sampling parameters)
  2. Column based a. remove uneeded columns from the data frame

Use cases:

  1. I am using hadoop and I want to export predictions. I only want row that where p1 > .10, also columns, x,y,z + the prediction data. As I type this, I waiting 4 hours to export a large dataset using a 220 node cluster.
hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5643 Assignee: New H2O Bugs Reporter: Matthew Burris State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A