h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.88k stars 1.99k forks source link

Would be good if h2o automatically calls the garbage collection more often if a lot of temporary files are getting created #9465

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

if user is creating features in a loop and adding to the original dataset - say -

{code:java} for(i in 75:168){ j= indx[i] print(j) train[, ncol(train)+1] <- h2o.ifelse(train[,j]%in% c("One","Two","Three"), train[,j], "Unknown") train = h2o.assign(train,key = "train")

print(h2o.ls())

print(dim(train)) }

{code}

h2o creates a lot of temp files (same size as the original datafile could be gbs) that takes a long time to garbage collect and (might) lead to OOM situations.

To avoid this situation, realized that had to explicitly call h2o.ls() inside the loop (which explicitly garbage collect the temp values.) Would be good if h2o automatically calls the garbage collection more often (based on how many temps getting created) (Also notice that, if go to flow and explicitly delete the temp files(created from R), sometimes the original datafile also gets deleted.)

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-2521 Assignee: New H2O Bugs Reporter: Nidhi Mehta State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A