h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.92k stars 2k forks source link

Lookup column: add column in-place during join #10375

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Having a big table "big", and tiny dictionary "dict", when we merge both datasets we create additional copy of data, even when we assign to same variable in R. If we just want to lookup new column to big table from dictionary, we need to remove original h2o frame to avoid copy. It would be handy to add column in-place during join. Below code shows the current workflow, to join, and later remove original h2o frame. {code} memory_usage <- function() { res <- h2o:::.h2o.fromJSON(jsonlite::fromJSON(h2o:::.h2o.doSafeGET(urlSuffix = h2o:::.h2o.__CLOUD), simplifyDataFrame = FALSE)) sum(sapply(res$nodes, [[, "mem_value_size") / (1024^2)) # MB } library(h2o) h2o.init() memory_usage()

[1] 0

big = as.h2o(iris) dict = data.frame(Species=c("virginica","versicolor","setosa"), new_species=c(rep("versinica",2), "setosinica")) dict = as.h2o(dict) h2o.ls()

key

1 dict

2 iris

memory_usage()

[1] 0.0078125

big = h2o.merge(big, dict, by="Species") h2o.ls()

key

1 RTMP_sid_9aec_8

2 dict

3 iris

memory_usage()

[1] 0.015625

h2o.getId(big)

[1] "RTMP_sid_9aec_8"

h2o.rm("iris") h2o.ls()

key

1 RTMP_sid_9aec_8

2 dict

memory_usage()

[1] 0.01074219

{code}

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3464 Assignee: Matt Dowle Reporter: Jan Gorecki State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A