H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
h2o creates a lot of temp files (same size as the original datafile could be gbs) that takes a long time to garbage collect and (might) lead to OOM situations.
To avoid this situation, realized that had to explicitly call h2o.ls() inside the loop (which explicitly garbage collect the temp values.)
Would be good if h2o automatically calls the garbage collection more often (based on how many temps getting created)
(Also notice that, if go to flow and explicitly delete the temp files(created from R), sometimes the original datafile also gets deleted.)
if user is creating features in a loop and adding to the original dataset - say -
{code:java} for(i in 75:168){ j= indx[i] print(j) train[, ncol(train)+1] <- h2o.ifelse(train[,j]%in% c("One","Two","Three"), train[,j], "Unknown") train = h2o.assign(train,key = "train")
print(h2o.ls())
print(dim(train)) }
{code}
h2o creates a lot of temp files (same size as the original datafile could be gbs) that takes a long time to garbage collect and (might) lead to OOM situations.
To avoid this situation, realized that had to explicitly call h2o.ls() inside the loop (which explicitly garbage collect the temp values.) Would be good if h2o automatically calls the garbage collection more often (based on how many temps getting created) (Also notice that, if go to flow and explicitly delete the temp files(created from R), sometimes the original datafile also gets deleted.)