h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

Confusion Matrix Does Not Match Predictions #12115

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

The confusion matrix does not match the predictions. For example, if I predict on my training data and calculate the number of true positives, this will not match the true positive cell in the confusion matrix.

A Python example is below:

{code:python} import h2o h2o.init()

data = h2o.import_file( path = "https://raw.githubusercontent.com/h2oai/app-consumer-loan/master/data/loan.csv", destination_frame = "data.hex", col_types = {'bad_loan': "enum"} )

from h2o.estimators import H2OGeneralizedLinearEstimator glm = H2OGeneralizedLinearEstimator(family = "binomial", lambda_search = True, model_id = "glm.hex") glm.train(y="bad_loan", training_frame=data)

glm.confusion_matrix(train = True)

predictions = data["bad_loan"].cbind(glm.predict(data)) predictions[(predictions["bad_loan"] == "1") & (predictions["predict"] == "1")].nrow {code}

exalate-issue-sync[bot] commented 1 year ago

Megan Kurka commented: The metrics: fns, tns, fps, tps in the "thresholds_and_metric_scores" table also match the confusion matrix but not the predictions.

exalate-issue-sync[bot] commented 1 year ago

Daniele Campisano commented: I found the same problem testing out class balancement, ([reference|https://stackoverflow.com/questions/49262383/h2o-python-balance-classes]): real problem there was a discrepancy between model.confusion_matrix(train) and .model_performance(train=True) , where only the latter returns the correct result.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Hi [~accountid:5ab23ae918c3bd2a73ff17e7], thanks for reporting - I tested your example and it is a different bug in my opinion, I made a new jira PUBDEV-5455. Your issue is triggered by balancing classes.

The issue reported in the original jira cannot happen on the dataset with less than 400 rows. You are testing on the iris dataset (160 rows), The cause is therefore definitely different.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: H2O uses on-line version of an algorithm that calculates the histogram of predictions. This is just an approximation of the actual histogram.

exalate-issue-sync[bot] commented 1 year ago

Nidhi Mehta commented: #93819 (https://support.h2o.ai/a/tickets/93819) - Question on Confusion Matrix

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5243 Assignee: Michal Kurka Reporter: Megan Kurka State: Open Fix Version: N/A Attachments: Available (Count: 1) Development PRs: Available

Linked PRs from JIRA

https://github.com/h2oai/h2o-3/pull/6312

Attachments From Jira

Attachment Name: confusion_matrix.py Attached By: Lauren DiPerna File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5243/confusion_matrix.py