tidymodels / TMwR

Code and content for "Tidy Modeling with R"
https://tmwr.org
Other
579 stars 272 forks source link

Figure 9.1 looks confusing #366

Closed thrkng closed 1 year ago

thrkng commented 1 year ago

Figure 9.1 and the relating descriptions seem to claim that minimizing RMSE and maximizing coefficient of determination (R-squared) are different and lead to different predictions, which is confusing to me. I suppose R-squared is defined as $R^2 = 1 – \frac{\Sigma_{i=1}^{n}(y_i - \hat{yi})^2}{\Sigma{i=1}^{n}{(yi – \bar{y})^2}}$ and that reduces to $R^2 = 1 – \frac{n \cdot {RMSE}^2}{\Sigma{i=1}^{n}{(y_i – \bar{y})^2}}$. Because $y_i$ and $\bar{y}$ do not change for the same dataset, minimizing RMSE should be equivalent to maximizing $R^2$. Have I misunderstood anything...? (maybe coefficient of determination defined differently?)

And I suppose RMSE and MAE (Mean Absolute Error) would be one of the candidates of sets of metrics that would clearly lead to different predictions. Thank you for the great guide anyways.

topepo commented 1 year ago

minimizing RMSE should be equivalent to maximizing $R^2$.

That's not always the case. $R^2$ measures correlation ($R$), not accuracy. You can have a tight correlation between the observed and predicted values but with an intercept that is not zero. This would give you a good $R^2$ but poor RMSE since some of the errors are large.

In practice, this can happen with tree ensembles that have shallow trees.

$R^2$ is normalized to a fraction by the total sums of squares and that makes it a relative measure; RMSE is an absolute measure of performance.

thrkng commented 1 year ago

Thank you for the reply. According to wikipedia and other sources, $R^2$ defined as the equation I wrote above equals to the square of correlation between the observed and predicted under certain limited conditions. So to be clear, $R^2$ here is defined as the square of correlation between the observed and predicted, but not necessarily $1-\frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y_i})^2}$, correct?

topepo commented 1 year ago

$R^2$ defined as the equation

Sorry to be picky, but that is not the case. The coefficient of determination parameter is defined to be the proportion of the variation in the dependent variable that is predictable from the independent variable(s).

The equation you show is an estimator of $R^2$ derived from linear regression models. They are different estimators (and see Kvalseth (1985) for even more).

We tend to use the squared estimate of the estimate correlation because 1) it is not tied to linear models and 2) it works better when the estimate is near zero.

thrkng commented 1 year ago

Thank you for the clarification. That gave me an understanding of multiple $R^2$'s. And now I found rsq() in yardstick package is explained to be "simply the squared correlation between truth and estimate", just as you told me. I really appreciate your kind help and of course the great books and packages!

github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.