h2oai / h2o-3

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.
http://h2o.ai
Apache License 2.0
6.85k stars 2k forks source link

Inadvisable R^2 calculation for non-linear models #12248

Open exalate-issue-sync[bot] opened 1 year ago

exalate-issue-sync[bot] commented 1 year ago

We have been using the standard OLS version of R^2, which is 1-SSE/SST. However, as [~accountid:557058:48e89f1c-a013-4bf8-9fc5-3a33bb40825e] pointed out, this is not appropriate for non-linear models (even though some other popular software like sklearn uses this for any model).

Therefore, we need to implement another R^2 function which can be used for non-linear models. Here is a description of the issue and solution:

{code} For ordinary least squares models, R^2 is the squared multiple correlation. This is equivalent mathematically to 1 - SSE/SST. For all other models, this equivalence does not hold, so the 1 - SSE/SST formula cannot be used. In some cases, that formula can produce negative R^2 values, which is mathematically impossible for a real number. Instead, we use the squared Pearson correlation between the estimated and observed values on the predicted variable. Other software that uses the 1 - SSE/SST formula for non-OLS models is incorrect. {code}

We must allow weights; here's a reference to the weighted equation here: https://stats.stackexchange.com/questions/221246/such-thing-as-a-weighted-correlation.

There is a good reference that we should cite in our docs, noted by a person who pointed this out in sklearn: https://github.com/scikit-learn/scikit-learn/issues/5570

{code} I think the definition should then be changed on a per model basis or it should be changed to corr(y_pred,y_true)**2. The book "Applied Linear Regression" by S Weisberg mentions the issue I address above on page 84 of the third edition. It suggest to use corr(y_pred,y_true)^2 for nonlinear models and to alternate the definition as above for regression through the origin. Finally, with regards to regression, statsmodels does use a different formula for the r2_score depending on if you use or do not use the intercept in a regression. {code}

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Moving this from fix release to 3.22 since it will change numerical results.

exalate-issue-sync[bot] commented 1 year ago

Erin LeDell commented: Moving this from fix release to 3.24 since it will change numerical results.

hasithjp commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-5381 Assignee: UNASSIGNED Reporter: Erin LeDell State: Open Fix Version: N/A Attachments: N/A Development PRs: N/A