Closed exalate-issue-sync[bot] closed 1 year ago
Erin LeDell commented: Ticket is duplicate of: https://0xdata.atlassian.net/browse/PUBDEV-3593
JIRA Issue Migration Info
Jira Issue: PUBDEV-3554 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Closed Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A
Attachments From Jira
Attachment Name: Screen Shot 2016-10-13 at 1.16.21 PM.png Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-3554/Screen Shot 2016-10-13 at 1.16.21 PM.png
If you do a grid search using the parameter
class_sampling_factors
the grid search doesn't return the values, instead it returns memory locations (not sure if this is right). See the screen shot for the output. This is the case for both R and Pythonhere is example code to run: R {code} library(h2o) h2o.init()
import the covtype dataset:
This dataset is used to classify the correct forest cover type.
original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype
covtype <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
convert response column to a factor
covtype[,55] <- as.factor(covtype[,55])
set the predictor names and the response column name
predictors <- colnames(covtype[1:54]) response <- 'C55'
split into train and validation sets
covtype.splits <- h2o.splitFrame(data = covtype, ratios = .8, seed = 1234) train <- covtype.splits[[1]] valid <- covtype.splits[[2]]
look at the frequencies of each class
print(h2o.table(covtype['C55']))
try using the class_sampling_factors parameter:
since all but Class 2 have similar frequency counts, let's undersample Class 2
and not change the sampling rate of the other classes.
note: class_sampling_factors must be a list of floats
sample_factors <- c(1., 0.5, 1., 1., 1., 1., 1.) cov_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid, balance_classes = TRUE, class_sampling_factors = sample_factors, seed = 1234)
print the logloss for your model
print(h2o.logloss(cov_gbm, valid = TRUE))
grid over
class_sampling_factors
select the values for
class_sampling_factors
to grid overhyper_params <- list( class_sampling_factors = list(c(1., 0.5, 1., 1., 1., 1., 1.), c(2., 1., 2., 2., 2., 2., 2.), c(4., 0.5, 1., 1., 2., 2., 1.)))
this example uses cartesian grid search because the search space is small
and we want to see the performance of all models. For a larger search space use
random grid search instead: {'strategy': "RandomDiscrete"}
build grid search with previously made GBM and hyper parameters
grid <- h2o.grid(x = predictors, y = response, training_frame = train, validation_frame = valid, algorithm = "gbm", grid_id = "covtype_grid", balance_classes = TRUE, hyper_params = hyper_params, search_criteria = list(strategy = "Cartesian"), seed = 1234)
Sort the grid models by logloss
sortedGrid <- h2o.getGrid("covtype_grid", sort_by = "logloss", decreasing = FALSE)
sortedGrid {code}
Python {code} import h2o from h2o.estimators.gbm import H2OGradientBoostingEstimator h2o.init(strict_version_check=False) h2o.cluster().show_status()
import the covtype dataset:
This dataset is used to classify the correct forest cover type.
original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype
covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
convert response column to a factor
covtype[54] = covtype[54].asfactor()
set the predictor names and the response column name
predictors = covtype.columns[0:54] response = 'C55'
split into train and validation sets
train, valid = covtype.split_frame(ratios = [.8], seed = 1234)
look at the frequencies of each class
print(covtype[54].table())
try using the class_sampling_factors parameter:
since all but Class 2 have similar frequency counts, let's undersample Class 2
and not change the sampling rate of the other classes.
note: class_sampling_factors must be a list of floats
sample_factors = [1., 0.5, 1., 1., 1., 1., 1.] cov_gbm = H2OGradientBoostingEstimator(balance_classes = True, class_sampling_factors = sample_factors, seed = 1234) cov_gbm.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
print the logloss for your model
print('logloss', cov_gbm.logloss(valid = True))
grid over
class_sampling_factors
import Grid Search
from h2o.grid.grid_search import H2OGridSearch
select the values for
class_sampling_factors
to grid overthe first class_sampling_factors is the same as above
the second doubles the number of samples for all but Class 2
the third demonstrates a random option
hyper_params = {'class_sampling_factors': [[1., 0.5, 1., 1., 1., 1., 1.], [2., 1., 2., 2., 2., 2., 2.], [4., 0.5, 1., 1., 2., 2., 1.]]}
this example uses cartesian grid search because the search space is small
and we want to see the performance of all models. For a larger search space use
random grid search instead: {'strategy': "RandomDiscrete"}
initialize the GBM estimator
cov_gbm_2 = H2OGradientBoostingEstimator(balance_classes = True, seed = 1234)
build grid search with previously made GBM and hyper parameters
grid = H2OGridSearch(model = cov_gbm_2, hyper_params = hyper_params,
search_criteria = {'strategy': "Cartesian"})
train using the grid
grid.train(x = predictors, y = response, training_frame = train, validation_frame = valid)
sort the grid models by increasing logloss
sorted_grid = grid.get_grid(sort_by='logloss', decreasing=False) print(sorted_grid) {code}