h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

destination_frame needed on as.factor(), h2o.deepfeatures(), etc. #10223

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

There are some API functions that return new data frames, but give no opportunity to give them a name. They get some random looking name, meaning you are left guessing which is which.

Here is a full use case, for as.factor, at least:

  1. Running 2+ nodes, on EC2, and running rstudio on each node.
  2. From node 1 I've done: {code} train = h2o.importFile(...) valid = h2o.importFile(...) test = h2o.importFile(...) train[,"ans"] = as.factor(train[,"ans"]) valid[,"ans"] = as.factor(valid[,"ans"]) test[,"ans"] = as.factor(test[,"ans"]) {code}
  3. I've then started a long-running model on node 1, so rstudio is busy
  4. Open rstudio on node 2, wanting to do something else with train/valid/test. {code} train = h2o.getFrame("???") test = h2o.getFrame("???") valid = h2o.getFrame("???") {code}

(In this case I could use Flow to work out which was train, by the number of rows; but test and valid were the same size! I was reduced to guessing, then looking at the original csv files to see if I had guessed correctly.)

BTW, in this case, if the randomly generated name of a copy frame was based on the name of the original frame, I'd have been okay. (importFile() chooses a name based on the csv filename.) It'd be nice to have that feature, too.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3308 Assignee: New H2O Bugs Reporter: Darren Cook State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A