h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.9k stars 2k forks source link

At the end of the K-means algorithm, strange training metrics are assigned to the output #8543

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

These lines cause the training metrics from the last iteration are replaced by another unknown metrics from DKV. There should be metrics from the last Kmeans iteration. If this is turn on, it causes the result metrics don't match any calculated metrics from all iterations. Especially for Constrained Kmeans, it returns a result that does not meet the stated constraints.

[https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java#L360-L361|https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java#L360-L361]

Probably there is the same problem for the validation set.

For example, in the iris dataset I printed for every Loyd iteration the centroid statistic:

11-21 16:25:51.714 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Size Within Cluster Sum of Squares 11-21 16:25:51.714 10.30.0.22:54321 15364 FJ-1-15 INFO: 1 362 6904.80395 11-21 16:25:51.714 10.30.0.22:54321 15364 FJ-1-15 INFO: 2 10 208.57395 11-21 16:25:51.714 10.30.0.22:54321 15364 FJ-1-15 INFO: 3 8 114.59766 11-21 16:25:51.726 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Statistics: 11-21 16:25:51.726 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Size Within Cluster Sum of Squares 11-21 16:25:51.726 10.30.0.22:54321 15364 FJ-1-15 INFO: 1 323 2225.26394 11-21 16:25:51.726 10.30.0.22:54321 15364 FJ-1-15 INFO: 2 42 235.42798 11-21 16:25:51.726 10.30.0.22:54321 15364 FJ-1-15 INFO: 3 15 154.36347 11-21 16:25:51.729 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Statistics: 11-21 16:25:51.729 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Size Within Cluster Sum of Squares 11-21 16:25:51.729 10.30.0.22:54321 15364 FJ-1-15 INFO: 1 264 1750.57349 11-21 16:25:51.729 10.30.0.22:54321 15364 FJ-1-15 INFO: 2 91 406.90851 11-21 16:25:51.729 10.30.0.22:54321 15364 FJ-1-15 INFO: 3 25 299.63215 11-21 16:25:51.732 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Statistics: 11-21 16:25:51.732 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Size Within Cluster Sum of Squares 11-21 16:25:51.732 10.30.0.22:54321 15364 FJ-1-15 INFO: 1 209 1297.35668 11-21 16:25:51.732 10.30.0.22:54321 15364 FJ-1-15 INFO: 2 135 622.72886 11-21 16:25:51.732 10.30.0.22:54321 15364 FJ-1-15 INFO: 3 36 419.02747

The result centroid statistics on training data are completely different:

11-21 16:25:51.749 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Statistics: 11-21 16:25:51.749 10.30.0.22:54321 15364 FJ-1-15 INFO: Centroid Size Within Cluster Sum of Squares 11-21 16:25:51.749 10.30.0.22:54321 15364 FJ-1-15 INFO: 1 166 953.09324 11-21 16:25:51.749 10.30.0.22:54321 15364 FJ-1-15 INFO: 2 169 794.90234 11-21 16:25:51.749 10.30.0.22:54321 15364 FJ-1-15 INFO: 3 45 486.23239

exalate-issue-sync[bot] commented 1 year ago

Veronika Maurerová commented: Resolved in https://0xdata.atlassian.net/browse/PUBDEV-6966

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7097 Assignee: Veronika Maurerová Reporter: Veronika Maurerová State: Resolved Fix Version: N/A Attachments: N/A Development PRs: N/A