h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.83k stars 2k forks source link

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : Unexpected CURL error: getaddrinfo() thread failed to start #7670

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I am running into an issue with AutoML when run in a loop in R. A single instance of AutoML does not run into this issue but I have tried freeing up memory and the issue still occurs.

Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : Unexpected CURL error: getaddrinfo() thread failed to start

{code:R}

Preallocate vector for aucs

results4 <- c() predictions4 <- c() model_types4 <- c() numModels <- 100 maxRuntime <- 60 # This is in seconds

Run 100 expirements or train 100 Auto ML models using randomized set of training data each time

Each model will also have 5 fold cross-validation as a base parameter.

This section is for the 4 split model

Load h2o

h2o.init(nthreads=15)

Create training and validation frames

Get CNV Scale Data

condensed <- read.csv("/data/ukbiobank/ukb_l2r_ids_allchr_condensed_4splits.txt", sep = " ")

AD data

my_ukb_data <- ukb_df("ukb39651", path="/data/ukbiobank") my_data <- select(my_ukb_data,eid, datereported = date_f00_first_reported_dementia_in_alzheimers_disease_f130836_0_0, sourcereported = source_of_report_of_f00_dementia_in_alzheimers_disease_f130837_0_0)

Get age related information

my_ukb_data_cancer <- ukb_df("ukb29274", path = "/data/ukbiobank/cancer") my_data_age <- select(my_ukb_data_cancer, eid, yearBorn = year_of_birth_f34_0_0)

Merge with CNV data

all_data <- merge(condensed, my_data, by.x = "ids", by.y = "eid") all_data <- merge(all_data, my_data_age, by.x = "ids", by.y = "eid")

alzheimers <- all_data[!is.na(all_data[, "datereported"]),] no_alzheimers_initial <- all_data[is.na(all_data[, "datereported"]),]

Get breakdown of patients by age

alzheimers_age <- table(alzheimers$yearBorn)

for (i in 1:numModels) {

Randomly get non disease patients for controls so that there is an equal amount based on age

This will ensure that the controls are age-matched to the disease sample

For example there are 5 patients born 1937 who have AD so we will randomly grab 5 other

patients born 1937 who do not have AD

no_alzheimers <- data.frame(matrix(ncol = ncol(no_alzheimers_initial), nrow = 0)) colnames(no_alzheimers) <- colnames(no_alzheimers_initial) for (i in 1:length(alzheimers_age)) { temp <- alzheimers_age[i] age_check <- as.numeric(names(temp)) number_cases <- as.numeric(unname(temp)) possible_controls <- no_alzheimers_initial %>% filter(yearBorn == age_check) no_alzheimers <- rbind(no_alzheimers, possible_controls[sample(nrow(possible_controls), number_cases, replace = TRUE), ]) }

alzheimers$datereported <- TRUE no_alzheimers$datereported <- FALSE

ind <- sample(c(TRUE, FALSE), nrow(alzheimers), replace=TRUE, prob=c(0.7, 0.3)) # Random split

train <- alzheimers[ind, ] validate <- alzheimers[!ind, ]

controls <- no_alzheimers # get controls

train_controls <- controls[ind, ] validate_controls <- controls[!ind, ]

Combine controls with samples

train <- rbind(train, train_controls) validate <- rbind(validate, validate_controls)

Set response column to factor

train$datereported <- as.factor(train$datereported) validate$datereported <- as.factor(validate$datereported)

Remove unnecessary columns

train <- train[,!names(train) %in% c("ids", "sex", "behavior")] validate <- validate[,!names(validate) %in% c("ids", "sex", "behavior")]

Load data into h2o

train.hex <- as.h2o(train, destination_frame = "train.hex")
validate.hex <- as.h2o(validate, destination_frame = "validate.hex")

Response column

response <- "datereported"

Get Predictors

predictors <- colnames(train) predictors <- predictors[! predictors %in% response] #Response cannot be a predictor predictors <- predictors[! predictors %in% "yearBorn"] #Response cannot be a predictor predictors <- predictors[! predictors %in% "sourcereported"] #Response cannot be a predictor model <- h2o.automl(x = predictors, y = response, training_frame = train.hex, validation_frame = validate.hex, nfolds=5, max_runtime_secs = maxRuntime)

record the Leading model AUC in the dataset

leader <- model@leader auc=h2o.auc(leader, train=FALSE, xval=TRUE) results4 <- c(results4, auc) model_types4 <- c(model_types4, leader@algorithm)

Attempt predict on validation frame

prediction <- h2o.predict(object = leader, newdata = validate.hex) as.data.frame(prediction) summary(prediction, exact_quantiles = TRUE)

validation.perf <- h2o.performance(leader, train = FALSE, xval=TRUE, newdata = validate.hex) validation.perf.auc <- validation.perf@metrics$AUC

predictions4 <- c(predictions4, validation.perf.auc) h2o.removeAll()

rm(train.hex, validate.hex, model, leader)

trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.

gc()

optional extra one to be paranoid. this is usually very fast.

gc()

optionally sanity check that you see only what you expect to see here, and not more.

h2o.ls()

tell back-end cluster nodes to do three back-to-back JVM full GCs.

h2o:::.h2o.garbageCollect() h2o:::.h2o.garbageCollect() h2o:::.h2o.garbageCollect() } {code}

This has worked in the past so I'm not sure what changed between versions but I updated to the most recent version to use the Explainability plots.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Thanks for the report, [~accountid:6009ade2ea0e64006b75e7ea]; we will take a look! What version of H2O were you using before when it was working?

exalate-issue-sync[bot] commented 1 year ago

Chris Toh commented: @Erin LeDell I believe it was 3.32.0.0. I am running R v4.0.3 in Linux Ubuntu 20.0.4.1 LTS

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: The new explainability plots came out in 3.32.0.1 (we don’t have a .0), so it sounds like you were using a version before that (w/o the explainability).

exalate-issue-sync[bot] commented 1 year ago

Chris Toh commented: Then it was most likely 3.30.1.2

exalate-issue-sync[bot] commented 1 year ago

Chris Toh commented: I attempted to re-install to latest version 3.32.0.3 and got this message in R

{noformat}> install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R"))) Warning in install.packages : unable to access index for repository https://cloud.r-project.org/src/contrib: cannot open URL 'https://cloud.r-project.org/src/contrib/PACKAGES' Installing package into ‘/home/tohc/R/x86_64-pc-linux-gnu-library/4.0’ (as ‘lib’ is unspecified) Warning in install.packages : unable to access index for repository http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib: cannot open URL 'http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib/PACKAGES' Warning in install.packages : package ‘h2o’ is not available for this version of R

A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages{noformat}

exalate-issue-sync[bot] commented 1 year ago

Sebastien Poirier commented: [~accountid:6009ade2ea0e64006b75e7ea] did you retry to install the latest version since?

It seems to work now:

{noformat}> install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R"))) Installing package into ‘/Users/seb/Library/R/4.0/library’ (as ‘lib’ is unspecified) trying URL 'http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib/h2o_3.32.0.4.tar.gz' Content type 'application/x-tar' length 164636126 bytes (157.0 MB)

downloaded 157.0 MB

exalate-issue-sync[bot] commented 1 year ago

Chris Toh commented: @Sebastien Poirier Yes the new version installed successfully now. The error itself still occurs.

exalate-issue-sync[bot] commented 1 year ago

Chris Toh commented: There seems to be some sort of memory leak. Even when allocating 32GB to h2o.init() and clearing the memory within the loop, the memory usage still slowly creeps up.

h2o-ops commented 1 year ago

JIRA Issue Details

Jira Issue: PUBDEV-7972 Assignee: Sebastien Poirier Reporter: Chris Toh State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A