h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 1.99k forks source link

the predict function returned some odd results for models built by h2o.kmeans() #12286

Closed exalate-issue-sync[bot] closed 1 year ago

exalate-issue-sync[bot] commented 1 year ago

I created a clustering model using h2o.kmeans(). The modeling dataset was standardized by scale() in R first.

The model has five clusters and the coordinates of the centroids are:

CENTROID X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 1 -0.646544 -0.6322714 -0.5101907 -0.2980412 -1.6182105 -1.7939725 -1.8194372 -1.82349 -1.8174061 -1.8069266 -2.2213561 -2.2618561 -2.2170297 -2.2004509 -2.196722 -2.2267695 -2.2536694 -2.2653944 -2.1599764 -2.2074994 -1.9114193 -2.78E-16 2 -0.2505012 -0.2582746 -0.2542313 -0.3205136 0.2912933 0.3239872 0.3236214 0.3231876 0.3234663 0.309818 0.362641 0.3800735 0.3615138 0.3542787 0.350817 0.3583391 0.375764 0.3715018 0.3533203 0.3533025 0.2651153 3.72E-15 3 0.4237044 0.4421857 0.408422 0.6620773 0.2371281 0.2592748 0.2597783 0.2782299 0.258803 0.3129833 0.4157714 0.3704712 0.3948566 0.4137049 0.4289137 0.4229101 0.3904031 0.4323851 0.3984215 0.442518 0.5278553 1.00E+00 4 2.2426614 2.2450805 2.0475964 1.5666675 0.2249847 0.2887632 0.3391117 0.3224008 0.3375972 0.3617759 0.5063836 0.4805747 0.5226613 0.5097081 0.5196333 0.5136624 0.4780912 0.4686772 0.4743151 0.5357567 0.5734882 8.24E-01 5 4.4718381 4.5243432 4.8917335 5.223828 0.2374653 0.3096633 0.3215417 0.3326531 0.3189998 0.414707 0.5065842 0.5113028 0.558864 0.5482378 0.543278 0.5436269 0.5204451 0.5341745 0.5096259 0.6486469 0.6595461 9.89E-01

When using the model to make predictions for new data, mostly the result makes sense, which returns the cluster whose centroid has the shortest euclidean distance to the data point; however, sometimes the prediction result is off. For example, for a data point as below:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 X22 -0.2001578 -0.2485784 -0.3008685 -0.005366991 0.2624246 0.3142725 0.3074037 0.3221539 0.3033765 0.3403944 0.3557642 0.3810387 0.4848038 0.2788213 0.544491 0.2838926 0.2899755 0.3963652 0.2594092 0.3083141 0.463528 1

The prediction is cluster 3; however, the euclidean distance between the data point and centroids are:

cluster 1: 10 cluster 2: 1.11 cluster 3: 1.39 cluster 4: 4.53 cluster 5: 9.97.

Based on the calculation above, the data point should be assigned to cluster 2, not 3.

Could anyone look into this discrepancy? Is it a bug or h2o.kmeans() uses other methods instead of euclidean distance for prediction?

I attached an excel file with the coordinates of five centroids and new data point.

Thank you.

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:3153fc68-3f65-4f5d-843b-cb78a7048231], can you please include a MOJO or POJO of this model? This would help us with figuring out of the issue. Also please attach parameters of k-means used for training

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: Hi Michal:

I won't be able to share the MOJO of the model directly due to compliance concerns. Let me try to replicate the problem with a public dataset and get back to you.

Thanks a lot!

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Thank you!

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: Hi Michal:

I figured out the reason for the discrepancy. There is an argument "standardize" in the h2o.kmeans() function and by default it is TRUE. After setting it to FALSE, there is no discrepancy between the prediction returned by the function and prediction made manually based on the euclidean distance.

I had to scale my dataset before importing it to h2o because I needed to scale it by group, which cannot be achieved by h2o.scale(); probably this conflicted with the standardization made by h2o.kmeans() later and caused the error.

Thank you for the help! You may close the ticket.

Yu Cao

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: [~accountid:557058:3153fc68-3f65-4f5d-843b-cb78a7048231], thank you for letting us know. Glad you resolved it!

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: Hi Michal:

In fact I have another request -- after I set standardize = FALSE, the discrepancy disappeared; however, the model returned was not as good as before. Therefore, I may have to switch back to use standardize =TRUE.

Could you tell me how the argument standardize works, please? Does it equal scale() for every variable used?

Thanks a lot!

Yu Cao

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Sure!

If standardization is enabled, it is applied to all numerical columns. Standardization will center the values by subtracting the mean of the column and it will scale the values by dividing the value by the standard deviation of the column.

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: Thanks! So my understanding is: the prediction is based on the euclidean distance between a standardized data point and the standardized cluster centers, which can be found in @model$center_std. Is it correct?

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Yes, the all the distances are calculated in the standardized space.

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: If I used a set called data.dev to build the model and then used the model to make prediction on a new set called data.oot, will data.oot be standardized using the means and stds of data.dev or data.oot?

Thank you!

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: It will be standardized using the training data, in your case data.dev

exalate-issue-sync[bot] commented 1 year ago

Yu Cao commented: Awesome. Thank you!

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5419 Assignee: Yu Cao Reporter: Yu Cao State: Resolved Fix Version: N/A Attachments: Available (Count: 1) Development PRs: N/A

Attachments From Jira

Attachment Name: coordinates.xlsx Attached By: Yu Cao File Link:https://h2o-3-jira-github-migration.s3.amazonaws.com/PUBDEV-5419/coordinates.xlsx