h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.87k stars 1.99k forks source link

max_hit_ratio_k doesn't get called in the backend #12786

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

It seems like max_hit_ratio_k doesn't actually get used in the backend. We should fix this, so that if a user sets this argument it changes what is shown by the hit_ratio_table() method (py/r) and the K-Top Hit Ratio table that gets displayed by default when you do model.show().

more details on how this parameter should work can be found here: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/max_hit_ratio_k.html

code the test the issue out below, note that setting max_hit_ratio_k = 3 or not specifying the max_hit_ratio_k parameter doesn't change the output.

{code} library(h2o) h2o.init()

import the covtype dataset:

this dataset is used to classify the correct forest cover type

original dataset can be found at https://archive.ics.uci.edu/ml/datasets/Covertype

covtype <- h2o.importFile("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")

convert response column to a factor

covtype[,55] <- as.factor(covtype[,55])

set the predictor names and the response column name

predictors <- colnames(covtype[1:54]) response <- 'C55'

split into train and validation sets

covtype.splits <- h2o.splitFrame(data = covtype, ratios = .8, seed = 1234) train <- covtype.splits[[1]] valid <- covtype.splits[[2]]

try using the max_hit_ratio_k parameter:

max_hit_ratio_k does not affect the actual model fit, and is for information

and inner-H2O calculations

cov_gbm <- h2o.gbm(x = predictors, y = response, training_frame = train, validation_frame = valid, max_hit_ratio_k = 3, seed = 1234)

print out model results to see the max_hite_ratio_k table

cov_gbm

note that table display wont change when you set max_hit_ratio_k less than 7

h2o.hit_ratio_table(cov_gbm, train = T) {code}

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5935 Assignee: New H2O Bugs Reporter: Lauren DiPerna State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A