h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.86k stars 1.99k forks source link

Discrepancy in confusion matrix for models for > few hundred rows #12321

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

Reported on StackOverflow: https://stackoverflow.com/questions/49262383/h2o-python-balance-classes (by [~accountid:5ab23ae918c3bd2a73ff17e7])

When a model was trained with DRF and balance_classes enabled the users will get different results when running

{code}

h2o.confusionMatrix(mpy) Confusion Matrix: Row labels: Actual class; Column labels: Predicted class Iris-setosa Iris-versicolor Iris-virginica Error Rate Iris-setosa 37 0 0 0.0000 = 0 / 37 Iris-versicolor 0 33 3 0.0833 = 3 / 36 Iris-virginica 0 3 33 0.0833 = 3 / 36 Totals 37 36 36 0.0550 = 6 / 109 {code}

and

{code}

h2o.confusionMatrix(mpy, newdata = training_frame) Confusion Matrix: Row labels: Actual class; Column labels: Predicted class Iris-setosa Iris-versicolor Iris-virginica Error Rate Iris-setosa 30 0 0 0.0000 = 0 / 30 Iris-versicolor 0 34 0 0.0000 = 0 / 34 Iris-virginica 0 0 36 0.0000 = 0 / 36 Totals 30 34 36 0.0000 = 0 / 100 {code}

Same behavior for R & Python

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Jira PUBDEV-5243 is somehow related but the underlying problem is different.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5455 Assignee: New H2O Bugs Reporter: Michal Kurka State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A