marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Statistical significance of coefficients #29

Closed snvv closed 11 months ago

snvv commented 1 year ago

Thank you for your excellent package!!! I wonder whether is it possible to print the significance of the estimated linear coefficients (e.g., se, t-stat, p-value) Regards

marjoleinF commented 1 year ago

You are welcome!

On the training data supplied to fit the model, there is no way to compute these in a valid manner. Significance tests are not available for the lasso either (well, that's not the whole story, but I will skip the details). The main issue is that if the training data has been used to find and select the rules, and to select the linear terms, this cannot easily be accounted for when computing the test statistic and p-value.

Valid computation of t- and p-values would be done on a new, separate dataset, by using the rules and linear terms that have been selected on a current training dataset. Example:

library("pre")
airq <- airquality[complete.cases(airquality), ]

## Create training and 'new', independent dataset
train_dat <- airq[1:71, ]
new_dat <- airq[72:111, ]

## Fit rule ensemble on training dataset
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = train_dat, relax = TRUE)
plot(airq.ens$glmnet)

## Collect rules retained (retain small set of rules for sake of example)
coefs <- coef(airq.ens, penalty = exp(1), gamma = 0)
coefs[coefs$coefficient != 0,]
##           rule coefficient                 description
## 81 (Intercept)   77.165400                           1
## 7        rule7  -23.461076                  Temp <= 83
## 20      rule22   18.727252     Temp > 76 & Wind <= 6.3
## 43      rule49  -15.910780       Temp <= 87 & Wind > 8
## 54      rule60  -14.357731 Wind > 6.3 & Solar.R <= 149
## 8        rule8  -11.486114     Temp <= 83 & Temp <= 77
## 31      rule35    2.121180       Temp > 77 & Wind <= 8
## 1        rule1   -1.871615                  Temp <= 82
retained <- coefs$description[coefs$coefficient != 0]

## Compute t- and p-values on new dataset

## Compute rules for the new data
eval_rules <- function(data, rule_descr) {
    1L * with(data, eval(parse(text = rule_descr)))
}
new_dat <- cbind(new_dat, 
                   mapply(FUN = eval_rules, rule_descr = retained[-1], ## -1 to omit intercept
                          MoreArgs = list(data = new_dat)))
head(new_dat)
##     Ozone Solar.R Wind Temp Month Day Temp <= 83 Temp > 76 & Wind <= 6.3 Temp <= 87 & Wind > 8
## 111    31     244 10.9   78     8  19          1                       0                     1
## 112    44     190 10.3   78     8  20          1                       0                     1
## 113    21     259 15.5   77     8  21          1                       0                     1
## 114     9      36 14.3   72     8  22          1                       0                     1
## 116    45     212  9.7   79     8  24          1                       0                     1
## 117   168     238  3.4   81     8  25          1                       1                     0
##     Wind > 6.3 & Solar.R <= 149 Temp <= 83 & Temp <= 77 Temp > 77 & Wind <= 8 Temp <= 82
## 111                           0                       0                     0          1
## 112                           0                       0                     0          1
## 113                           0                       1                     0          1
## 114                           1                       1                     0          1
## 116                           0                       0                     0          1
## 117                           0                       0                     1          1
summary(lm(Ozone ~ ., data = new_dat))
## 
## Call:
## lm(formula = Ozone ~ ., data = new_dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.906  -8.722  -0.375   4.959  66.402 
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                   257.68875  153.61112   1.678   0.1046 
## Solar.R                         0.02641    0.10419   0.253   0.8018  
## Wind                           -1.92530    1.74799  -1.101   0.2801  
## Temp                           -0.40605    1.02881  -0.395   0.6961  
## Month                         -18.53653   10.79536  -1.717   0.0970 .
## Day                             0.01697    0.52043   0.033   0.9742  
## `Temp <= 83`                  -11.87188   15.54397  -0.764   0.4514  
## `Temp > 76 & Wind <= 6.3`      22.83064   12.71209   1.796   0.0833 .
## `Temp <= 87 & Wind > 8`        -1.02151   14.53575  -0.070   0.9445  
## `Wind > 6.3 & Solar.R <= 149`  -1.55406   17.66235  -0.088   0.9305  
## `Temp <= 83 & Temp <= 77`     -16.01337   12.29165  -1.303   0.2033  
## `Temp > 77 & Wind <= 8`        13.96945   14.55672   0.960   0.3454  
## `Temp <= 82`                         NA         NA      NA       NA  
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## 
## Residual standard error: 19.73 on 28 degrees of freedom
## Multiple R-squared:  0.7756, Adjusted R-squared:  0.6875 
## F-statistic: 8.799 on 11 and 28 DF,  p-value: 1.734e-06