H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
Having a big table "big", and tiny dictionary "dict", when we merge both datasets we create additional copy of data, even when we assign to same variable in R. If we just want to lookup new column to big table from dictionary, we need to remove original h2o frame to avoid copy. It would be handy to add column in-place during join. Below code shows the current workflow, to join, and later remove original h2o frame.
{code}
memory_usage <- function() {
res <- h2o:::.h2o.fromJSON(jsonlite::fromJSON(h2o:::.h2o.doSafeGET(urlSuffix = h2o:::.h2o.__CLOUD), simplifyDataFrame = FALSE))
sum(sapply(res$nodes, [[, "mem_value_size") / (1024^2)) # MB
}
library(h2o)
h2o.init()
memory_usage()
Having a big table "big", and tiny dictionary "dict", when we merge both datasets we create additional copy of data, even when we assign to same variable in R. If we just want to lookup new column to big table from dictionary, we need to remove original h2o frame to avoid copy. It would be handy to add column in-place during join. Below code shows the current workflow, to join, and later remove original h2o frame. {code} memory_usage <- function() { res <- h2o:::.h2o.fromJSON(jsonlite::fromJSON(h2o:::.h2o.doSafeGET(urlSuffix = h2o:::.h2o.__CLOUD), simplifyDataFrame = FALSE)) sum(sapply(res$nodes,
[[
, "mem_value_size") / (1024^2)) # MB } library(h2o) h2o.init() memory_usage()[1] 0
big = as.h2o(iris) dict = data.frame(Species=c("virginica","versicolor","setosa"), new_species=c(rep("versinica",2), "setosinica")) dict = as.h2o(dict) h2o.ls()
key
1 dict
2 iris
memory_usage()
[1] 0.0078125
big = h2o.merge(big, dict, by="Species") h2o.ls()
key
1 RTMP_sid_9aec_8
2 dict
3 iris
memory_usage()
[1] 0.015625
h2o.getId(big)
[1] "RTMP_sid_9aec_8"
h2o.rm("iris") h2o.ls()
key
1 RTMP_sid_9aec_8
2 dict
memory_usage()
[1] 0.01074219
{code}