h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

Memory leak in H2O (standalone cluster) #15429

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

UPDATE

I created a reproducible example in R and tested it on a tiny 4 node linux cluster using h2o 3.8.3.2. - [^MCV_h2o_memory_leak.R]

The workflow creates dummy data and then iterativly computes a new model, make a prediction, calculates a dummy KPI and last removes the model plus the prediction data. It uses the "full blown gc" approach from Tom (https://groups.google.com/d/msg/h2ostream/Dc6l4xzwkaU/n-w2p02mBwAJ). You can run it with {code} R CMD BATCH --no-save --no-restore '--args IP="" PORT= N=<# of iterations> nCols=<# columns of the dataframe> nRows=<# rows of the dataframe> H2O_FULL_BLOWN_GC=' MCV_h2o_memory_leak.R {code}

I run it twice - one time only with a simple housekeeping (h2o.rm) and a larger dataset- and one time with Toms GC approach and a smaller dataset. In both cases I used a fresh h2o cluster where each of the four nodes was started accoring to {code} java -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:mem_log.txt -cp h2odriver.jar water.H2OApp -flatfile /h2o-3.8.3.2-cdh5.4.2/flatfile.txt -port 54321 {code}

I attached the jvm node logs from each run.

A first analysis indicates that in both cases the heap increases from iteration to iteration, regardless wether we use just a simple housekeeping or a multiple garbage collections.


Monitoring memory consumption in h2o shows that there is a memory leak when running repetitive model creation jobs. A typical ML use case when you want to do this is for example hyperparameter tuning, model validation using a resampling approach, feature selection, bootstrapping, .... Our example is about feature selection where we take a subset of the features, train a model and evaluate it afterwards. After each of these iterations all new data sets (prediction data set) and the model files are removed with h2o.rm() {code} h2o.rm(iter_model@model_id) h2o.rm(iter_data) {code}

The cluster is a six node cluster where each node is started with {code}JAVA_HEAP=8g{code} After we encountered the first problem, we start a new run and in parallel created a small monitoring script to get a constant update of h2o cluster statistics (using {code}h2o.clusterStatus(){code}). This script runs on the main node: [^h2o_cluster_status_tracker.R]

Additionally the script also counts the number of keys using the R API function {code}h2o.ls{code} (user objects). The analysis of the monitoring data after approx. 15 h shows !h2o statistics.png|thumbnail!

We keep our workspace clean so that the number of user objects is constant (kv_count)

Free mem is decreasing of time (free_mem)

POJO mem is increasing over time with clearly visible spikes over time (pojo_mem)

The pojo_mem spikes correspond with log warnings of the form [for node .45.2] {code} 08-03 22:03:48.134 172.17.45.2:54321 16 #e Thread WARN: Unblock allocations; cache below desired, but also OOM: GC CALLBACK, (K/V:11.7 MB + POJO:6.04 GB + FREE:1.06 GB == MEM_MAX:7.11 GB), desiredKV=910.3 MB OOM! {code}

[for node .45.3] {code} 08-03 22:10:15.645 172.17.45.3:54321 17 #e Thread WARN: Unblock allocations; cache below desired, but also OOM: GC CALLBACK, (K/V:10.0 MB + POJO:6.54 GB + FREE:572.2 MB == MEM_MAX:7.11 GB), desiredKV=910.3 MB OOM!

08-04 01:27:37.578 172.17.45.3:54321 17 FJ-2-31 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:11.4 MB + POJO:4.27 GB + FREE:2.83 GB == MEM_MAX:7.11 GB), desiredKV=3.20 GB OOM! 08-04 01:27:37.581 172.17.45.3:54321 17 FJ-2-23 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:11.4 MB + POJO:4.27 GB + FREE:2.83 GB == MEM_MAX:7.11 GB), desiredKV=3.20 GB OOM! 08-04 01:27:37.581 172.17.45.3:54321 17 FJ-2-19 WARN: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:11.4 MB + POJO:4.27 GB + FREE:2.83 GB == MEM_MAX:7.11 GB), desiredKV=3.20 GB OOM! {code}

As the number of user objects is constant, the memory increase indicates some kind of problematic garbage collection or housekeeping and have serious impact on the usage of the h2o cluster: node failure. In our first run we encountered this effect in a way that the current job stops further processing and most flow requests became unresponsive. To solve this problem we had to restart the cluster - meaning a complete loss of data and results.

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: I tried this myself, and am not seeing the same problem.

Running with h2o 3.10.0.8.

Here is how i started h2o:

{code} java -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx8g -jar ~/Downloads/h2o-3.10.0.8/h2o.jar 1> out 2> err & {code}

Here is the R script I am using:

{code} require(h2o)

args=(commandArgs(TRUE)) if(length(args)==0){ print("No arguments supplied.")

IP = "127.0.0.1" PORT = 54321 N = 100 nCols <- 100 nRows <- 10000 H2O_FULL_BLOWN_GC = TRUE

}else{ for(i in 1:length(args)){ print(args[[i]]) eval(parse(text=args[[i]])) } }

init h2o

h2o.init(ip = IP, port = PORT)

paramList = list(epochs = 50, activation = "Tanh", adaptive_rate = TRUE, loss = "Huber", hidden = c(90,10,90), input_dropout_ratio = 0.05, variable_importances = FALSE, ignore_const_cols = FALSE, l1 = 1e-5, l2 = 1e-4, loss = "Automatic", distribution = "AUTO", overwrite_with_best_model = TRUE, autoencoder = T, x = c(1:nCols))

data <- as.h2o(as.data.frame(matrix(rnorm(100nRows), ncol=nCols))) datatest <- as.h2o(as.data.frame(matrix(rnorm(round(100(nRows*1.5))), ncol=nCols))) result_set <- list() for(i in 1:N){ iter_model <- do.call(h2o.deeplearning, modifyList(list(training_frame = data, modelid = paste0("GCDEBUG",i)), paramList)) iter_predictions <- h2o.anomaly(iter_model, datatest) result_set[i] <- apply(as.data.frame(iter_predictions),2, mean)

h2o.rm(iter_model@model_id) h2o.rm(iter_predictions)

if(H2O_FULL_BLOWN_GC){ rm(iter_model); rm(iter_predictions) gc() gc() h2o:::.h2o.garbageCollect() Sys.sleep(5) h2o:::.h2o.garbageCollect() Sys.sleep(5) h2o:::.h2o.garbageCollect() Sys.sleep(5) } }

h2o.rm(datatest) h2o.rm(data) {code}

Attaching the graph from gcviewer (tom_gc_1.png)

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: !tom_gc_1.png|thumbnail!

exalate-issue-sync[bot] commented 1 year ago

Divya Mereddy commented: I also faced same problem with memory leakage

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: I tried this myself, and am not seeing the same problem.

Running with h2o 3.10.0.8.

Here is how i started h2o:

{code} java -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xmx8g -jar ~/Downloads/h2o-3.10.0.8/h2o.jar 1> out 2> err & {code}

Here is the R script I am using:

{code} require(h2o)

args=(commandArgs(TRUE)) if(length(args)==0){ print("No arguments supplied.")

IP = "127.0.0.1" PORT = 54321 N = 100 nCols <- 100 nRows <- 10000 H2O_FULL_BLOWN_GC = TRUE

}else{ for(i in 1:length(args)){ print(args[[i]]) eval(parse(text=args[[i]])) } }

init h2o

h2o.init(ip = IP, port = PORT)

paramList = list(epochs = 50, activation = "Tanh", adaptive_rate = TRUE, loss = "Huber", hidden = c(90,10,90), input_dropout_ratio = 0.05, variable_importances = FALSE, ignore_const_cols = FALSE, l1 = 1e-5, l2 = 1e-4, loss = "Automatic", distribution = "AUTO", overwrite_with_best_model = TRUE, autoencoder = T, x = c(1:nCols))

data <- as.h2o(as.data.frame(matrix(rnorm(100nRows), ncol=nCols))) datatest <- as.h2o(as.data.frame(matrix(rnorm(round(100(nRows*1.5))), ncol=nCols))) result_set <- list() for(i in 1:N){ iter_model <- do.call(h2o.deeplearning, modifyList(list(training_frame = data, modelid = paste0("GCDEBUG",i)), paramList)) iter_predictions <- h2o.anomaly(iter_model, datatest) result_set[i] <- apply(as.data.frame(iter_predictions),2, mean)

h2o.rm(iter_model@model_id) h2o.rm(iter_predictions)

if(H2O_FULL_BLOWN_GC){ rm(iter_model); rm(iter_predictions) gc() gc() h2o:::.h2o.garbageCollect() Sys.sleep(5) h2o:::.h2o.garbageCollect() Sys.sleep(5) h2o:::.h2o.garbageCollect() Sys.sleep(5) } }

h2o.rm(datatest) h2o.rm(data) {code}

Attaching the graph from gcviewer (tom_gc_1.png)

exalate-issue-sync[bot] commented 1 year ago

Tom Kraljevic commented: !tom_gc_1.png|thumbnail!

exalate-issue-sync[bot] commented 1 year ago

Divya Mereddy commented: I also faced same problem with memory leakage

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3203 Assignee: Roberto Rösler Reporter: Roberto Rösler State: Open Fix Version: N/A Attachments: Available (Count: 8) Development PRs: N/A

Attachments From Jira

Attachment Name: example_node1_multiple_gc.png Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/example_node1_multiple_gc.png

Attachment Name: example_node1_simple_housekeeping.png Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/example_node1_simple_housekeeping.png

Attachment Name: h2o_cluster_status_tracker.R Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/h2o_cluster_status_tracker.R

Attachment Name: h2o statistics.png Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/h2o statistics.png

Attachment Name: MCV_h2o_memory_leak.R Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/MCV_h2o_memory_leak.R

Attachment Name: multiple_gc.zip Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/multiple_gc.zip

Attachment Name: simple_housekeeping.zip Attached By: Roberto Rösler File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/simple_housekeeping.zip

Attachment Name: tom_gc_1.png Attached By: Tom Kraljevic File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3203/tom_gc_1.png