h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

Grid Search over Class_Sampling_Factors Returns Memory Locations instead of Values #10463

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

If you do a grid search using the parameter class_sampling_factors the grid search doesn't return the values, instead it returns memory locations (not sure if this is right). See the screen shot for the output. This is the case for both R and Python

here is example code to run: R {code} library(h2o) h2o.init()

import the covtype dataset:

This dataset is used to classify the correct forest cover type.

original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype

covtype <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")

convert response column to a factor

covtype[,55] <- as.factor(covtype[,55])

set the predictor names and the response column name

predictors <- colnames(covtype[1:54]) response <- 'C55'

split into train and validation sets

covtype.splits <- h2o.splitFrame(data = covtype, ratios = .8, seed = 1234) train <- covtype.splits[[1]] valid <- covtype.splits[[2]]

look at the frequencies of each class

print(h2o.table(covtype['C55']))

try using the class_sampling_factors parameter:

since all but Class 2 have similar frequency counts, let's undersample Class 2

and not change the sampling rate of the other classes.

note: class_sampling_factors must be a list of floats

sample_factors <- c(1., 0.5, 1., 1., 1., 1., 1.) cov_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid, balance_classes = TRUE, class_sampling_factors = sample_factors, seed = 1234)

print the logloss for your model

print(h2o.logloss(cov_gbm, valid = TRUE))

grid over class_sampling_factors

select the values for class_sampling_factors to grid over

hyper_params <- list( class_sampling_factors = list(c(1., 0.5, 1., 1., 1., 1., 1.), c(2., 1., 2., 2., 2., 2., 2.), c(4., 0.5, 1., 1., 2., 2., 1.)))

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: {'strategy': "RandomDiscrete"}

build grid search with previously made GBM and hyper parameters

grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid, algorithm = "gbm", grid_id = "covtype_grid", balance_classes = TRUE, hyper_params = hyper_params, search_criteria = list(strategy = "Cartesian"), seed = 1234)

Sort the grid models by logloss

sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid {code}

Python {code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init(strict_version_check=False) h2o.cluster().show_status()

import the covtype dataset:

This dataset is used to classify the correct forest cover type.

original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype

covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")

convert response column to a factor

covtype[54] = covtype[54].asfactor()

set the predictor names and the response column name

predictors = covtype.columns[0:54] response = 'C55'

split into train and validation sets

train, valid = covtype.split_frame(ratios = [.8], seed = 1234)

look at the frequencies of each class

print(covtype[54].table())

try using the class_sampling_factors parameter:

since all but Class 2 have similar frequency counts, let's undersample Class 2

and not change the sampling rate of the other classes.

note: class_sampling_factors must be a list of floats

sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] cov_gbm = H2OGradientBoostingEstimator(balance_classes = True, class_sampling_factors = sample_factors, seed = 1234) cov_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

print the logloss for your model

print('logloss', cov_gbm.logloss(valid = True))

grid over class_sampling_factors

import Grid Search

from h2o.grid.grid_search import H2OGridSearch

select the values for class_sampling_factors to grid over

the first class_sampling_factors is the same as above

the second doubles the number of samples for all but Class 2

the third demonstrates a random option

hyper_params = {'class_sampling_factors': [[1., 0.5, 1., 1., 1., 1., 1.], [2., 1., 2., 2., 2., 2., 2.], [4., 0.5, 1., 1., 2., 2., 1.]]}

this example uses cartesian grid search because the search space is small

and we want to see the performance of all models. For a larger search space use

random grid search instead: {'strategy': "RandomDiscrete"}

initialize the GBM estimator

cov_gbm_2 = H2OGradientBoostingEstimator(balance_classes = True, seed = 1234)

build grid search with previously made GBM and hyper parameters

grid = H2OGridSearch(model = cov_gbm_2, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})

train using the grid

grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)

sort the grid models by increasing logloss

sorted_grid = grid.get_grid(sort_by='logloss', decreasing=False) print(sorted_grid) {code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Ticket is duplicate of: https://0xdata.atlassian.net/browse/PUBDEV-3593

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-3554 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Closed Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: Screen Shot 2016-10-13 at 1.16.21 PM.png Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3554/Screen Shot 2016-10-13 at 1.16.21 PM.png