rsquaredacademy / olsrr

Tools for developing OLS regression models
https://olsrr.rsquaredacademy.com/
Other
102 stars 22 forks source link

Mallow's Cp behaves inconsistently depending on model specification #196

Closed grantbrown closed 2 years ago

grantbrown commented 2 years ago

Problem

The ols_mallows_cp function behaves differently for lm objects fit with a design matrix and those specified directly with a formula.

Minimal Example

library(olsrr)
data(iris)

MM <- model.matrix(Sepal.Length ~ ., data = iris)[,-1]

fullmodel = lm(Sepal.Length ~ MM, data = iris)
coefficients(fullmodel)
ols_mallows_cp(fullmodel, fullmodel)

## Method 2:
MMdf <- as.data.frame(MM)
MMdf$Sepal.Length <- iris$Sepal.Length
fullmodel2 <- lm(Sepal.Length ~ ., data = MMdf)
coefficients(fullmodel2)
ols_mallows_cp(fullmodel2, fullmodel2)

## Method 3:
fullmodel3 <- lm(Sepal.Length ~ ., data = iris)
ols_mallows_cp(fullmodel3, fullmodel3)

Expected Behavior

The Mallow's Cp method should not be sensitive to model specification. Furthermore, we know that we expect to get back a value equal to the number of columns in the design matrix when comparing the largest candidate model to itself.

Cause

Model dimension is measured inconsistently throughout ols-information-criteria.R, and in this case the method does not treat differently specified models consistently. Specifically, I believe that ols_mallows_cp should use model_n_coeffs for both p and q

Demonstration

model_rows <- function(model) {
  nrow(model.frame(model))
}
model_n_coeffs <- function(model) {
  length(model$coefficients)
}
full_model_coeffs <- function(model) {
  length(model.frame(model)) 
}

anova_coeffs <- function(model) {
  length(anova(model)[[1]])
}
model_rss <- function(model) {
  sum(residuals(model) ^ 2)
}

mcpout <- function(model, fullmodel, n, p, q) {

  sse <- model_rss(model)
  sec <- (n - (2 * p))
  mse <- rev(anova(fullmodel)[[3]])[1]

  (sse / mse) - sec

}

# Modified Function

ols_mallows_cp <- function(model, fullmodel) {

  #check_model(model)
  #check_model(fullmodel)

  n <- model_rows(model)
  p <- model_n_coeffs(model)
  q <- model_n_coeffs(fullmodel)

  mcpout(model, fullmodel, n, p, q)

}

Check:

ols_mallows_cp(fullmodel, fullmodel)
ols_mallows_cp(fullmodel2, fullmodel2)
ols_mallows_cp(fullmodel3, fullmodel3)
grantbrown commented 2 years ago

Fixed in https://github.com/rsquaredacademy/olsrr/pull/198