Open exalate-issue-sync[bot] opened 1 year ago
Erin LeDell commented: Thanks for the report, [~accountid:6009ade2ea0e64006b75e7ea]; we will take a look! What version of H2O were you using before when it was working?
Chris Toh commented: @Erin LeDell I believe it was 3.32.0.0. I am running R v4.0.3 in Linux Ubuntu 20.0.4.1 LTS
Erin LeDell commented: The new explainability plots came out in 3.32.0.1 (we don’t have a .0), so it sounds like you were using a version before that (w/o the explainability).
Chris Toh commented: Then it was most likely 3.30.1.2
Chris Toh commented: I attempted to re-install to latest version 3.32.0.3 and got this message in R
{noformat}> install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R"))) Warning in install.packages : unable to access index for repository https://cloud.r-project.org/src/contrib: cannot open URL 'https://cloud.r-project.org/src/contrib/PACKAGES' Installing package into ‘/home/tohc/R/x86_64-pc-linux-gnu-library/4.0’ (as ‘lib’ is unspecified) Warning in install.packages : unable to access index for repository http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib: cannot open URL 'http://h2o-release.s3.amazonaws.com/h2o/latest_stable_R/src/contrib/PACKAGES' Warning in install.packages : package ‘h2o’ is not available for this version of R
A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages{noformat}
Sebastien Poirier commented: [~accountid:6009ade2ea0e64006b75e7ea] did you retry to install the latest version since?
It seems to work now:
downloaded 157.0 MB
Chris Toh commented: @Sebastien Poirier Yes the new version installed successfully now. The error itself still occurs.
Chris Toh commented: There seems to be some sort of memory leak. Even when allocating 32GB to h2o.init() and clearing the memory within the loop, the memory usage still slowly creeps up.
JIRA Issue Details
Jira Issue: PUBDEV-7972 Assignee: Sebastien Poirier Reporter: Chris Toh State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A
I am running into an issue with AutoML when run in a loop in R. A single instance of AutoML does not run into this issue but I have tried freeing up memory and the issue still occurs.
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = page, : Unexpected CURL error: getaddrinfo() thread failed to start
{code:R}
Preallocate vector for aucs
results4 <- c() predictions4 <- c() model_types4 <- c() numModels <- 100 maxRuntime <- 60 # This is in seconds
Run 100 expirements or train 100 Auto ML models using randomized set of training data each time
Each model will also have 5 fold cross-validation as a base parameter.
This section is for the 4 split model
Load h2o
h2o.init(nthreads=15)
Create training and validation frames
Get CNV Scale Data
condensed <- read.csv("/data/ukbiobank/ukb_l2r_ids_allchr_condensed_4splits.txt", sep = " ")
AD data
my_ukb_data <- ukb_df("ukb39651", path="/data/ukbiobank") my_data <- select(my_ukb_data,eid, datereported = date_f00_first_reported_dementia_in_alzheimers_disease_f130836_0_0, sourcereported = source_of_report_of_f00_dementia_in_alzheimers_disease_f130837_0_0)
Get age related information
my_ukb_data_cancer <- ukb_df("ukb29274", path = "/data/ukbiobank/cancer") my_data_age <- select(my_ukb_data_cancer, eid, yearBorn = year_of_birth_f34_0_0)
Merge with CNV data
all_data <- merge(condensed, my_data, by.x = "ids", by.y = "eid") all_data <- merge(all_data, my_data_age, by.x = "ids", by.y = "eid")
alzheimers <- all_data[!is.na(all_data[, "datereported"]),] no_alzheimers_initial <- all_data[is.na(all_data[, "datereported"]),]
Get breakdown of patients by age
alzheimers_age <- table(alzheimers$yearBorn)
for (i in 1:numModels) {
Randomly get non disease patients for controls so that there is an equal amount based on age
This will ensure that the controls are age-matched to the disease sample
For example there are 5 patients born 1937 who have AD so we will randomly grab 5 other
patients born 1937 who do not have AD
no_alzheimers <- data.frame(matrix(ncol = ncol(no_alzheimers_initial), nrow = 0)) colnames(no_alzheimers) <- colnames(no_alzheimers_initial) for (i in 1:length(alzheimers_age)) { temp <- alzheimers_age[i] age_check <- as.numeric(names(temp)) number_cases <- as.numeric(unname(temp)) possible_controls <- no_alzheimers_initial %>% filter(yearBorn == age_check) no_alzheimers <- rbind(no_alzheimers, possible_controls[sample(nrow(possible_controls), number_cases, replace = TRUE), ]) }
alzheimers$datereported <- TRUE no_alzheimers$datereported <- FALSE
ind <- sample(c(TRUE, FALSE), nrow(alzheimers), replace=TRUE, prob=c(0.7, 0.3)) # Random split
train <- alzheimers[ind, ] validate <- alzheimers[!ind, ]
controls <- no_alzheimers # get controls
train_controls <- controls[ind, ] validate_controls <- controls[!ind, ]
Combine controls with samples
train <- rbind(train, train_controls) validate <- rbind(validate, validate_controls)
Set response column to factor
train$datereported <- as.factor(train$datereported) validate$datereported <- as.factor(validate$datereported)
Remove unnecessary columns
train <- train[,!names(train) %in% c("ids", "sex", "behavior")] validate <- validate[,!names(validate) %in% c("ids", "sex", "behavior")]
Load data into h2o
train.hex <- as.h2o(train, destination_frame = "train.hex")
validate.hex <- as.h2o(validate, destination_frame = "validate.hex")
Response column
response <- "datereported"
Get Predictors
predictors <- colnames(train) predictors <- predictors[! predictors %in% response] #Response cannot be a predictor predictors <- predictors[! predictors %in% "yearBorn"] #Response cannot be a predictor predictors <- predictors[! predictors %in% "sourcereported"] #Response cannot be a predictor model <- h2o.automl(x = predictors, y = response, training_frame = train.hex, validation_frame = validate.hex, nfolds=5, max_runtime_secs = maxRuntime)
record the Leading model AUC in the dataset
leader <- model@leader auc=h2o.auc(leader, train=FALSE, xval=TRUE) results4 <- c(results4, auc) model_types4 <- c(model_types4, leader@algorithm)
Attempt predict on validation frame
prediction <- h2o.predict(object = leader, newdata = validate.hex) as.data.frame(prediction) summary(prediction, exact_quantiles = TRUE)
validation.perf <- h2o.performance(leader, train = FALSE, xval=TRUE, newdata = validate.hex) validation.perf.auc <- validation.perf@metrics$AUC
predictions4 <- c(predictions4, validation.perf.auc) h2o.removeAll()
rm(train.hex, validate.hex, model, leader)
trigger removal of h2o back-end objects that got rm’d above, since the rm can be lazy.
gc()
optional extra one to be paranoid. this is usually very fast.
gc()
optionally sanity check that you see only what you expect to see here, and not more.
h2o.ls()
tell back-end cluster nodes to do three back-to-back JVM full GCs.
h2o:::.h2o.garbageCollect() h2o:::.h2o.garbageCollect() h2o:::.h2o.garbageCollect() } {code}
This has worked in the past so I'm not sure what changed between versions but I updated to the most recent version to use the Explainability plots.