marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Bug Report: Variable Importances Calculation in `importance.pre()` - Missing Loop Scope for Importance of Factor Variables #32

Closed schumaki closed 7 months ago

schumaki commented 8 months ago

I have identified a potential issue in the calculation of variable importances within the importance.pre() function in the source code here.

Description: It appears that the section responsible for obtaining importances for factors is referencing varimps$varname[i] outside the for loop, thus only capturing the last variable name when calculating factor importances.

This leads to missing importances for factor variables in the returned imps$varimps. Because otherwise only variables with zero importance are missing in imps$varimps, this might create a potentially misleading impression that all factor variables are omitted in the final model, while, in fact, their importances are not accurately captured.

Here is the relevant code snippet:

for(i in 1:nrow(varimps)) {
      ## Get imps from rules and linear functions
      for(j in 1:nrow(baseimps)) {

        ## if the variable name appears in the description (of rule or linear term):
        ##   (Note: EXACT matches are needed, so 1) there should be a space before 
        ##     and after the variable name in the rule and thus 2) there should be 
        ##     a space added before the description of the rule)
        if (grepl(paste0(" ", varimps$varname[i], " "), paste0(" ", baseimps$description[j]))) {
          ## Count the number of times it appears in the rule
          n_occ <- length(gregexpr(paste0(" ", varimps$varname[i], " "),
                                   paste0(" ", baseimps$description[j]), fixed = TRUE)[[1]])
          ## Add to the importance of the variable
          if (x$family %in% c("mgaussian", "multinomial")) {
            varimps[i, gsub("coefficient", "importance", coef_inds)] <- 
              varimps[i, gsub("coefficient", "importance", coef_inds)] + 
            (n_occ * baseimps[j, gsub("coefficient", "importance", coef_inds)] / baseimps$nterms[j])
          } else {
            varimps$imp[i] <- varimps$imp[i] + (n_occ * baseimps$imp[j] / baseimps$nterms[j])
          }
        }
      }
    } # <== this should be further below....

    ## Get imps for factors
    if (is.factor(x$data[ , varimps$varname[i]])) { # check if variable is a factor and add importance
        # !is.ordered(x$data[ , varimps$varname[i]])) {
      ## Sum baseimps$imp for which baseimps$rule has varimps$varname[i] as _part_ of its name
      if (x$family %in% c("mgaussian", "multinomial")) {
        varimps[i, gsub("coefficient", "importance", coef_inds)] <-
          varimps[i, gsub("coefficient", "importance", coef_inds)] +
          colSums(baseimps[grepl(varimps$varname[i], baseimps$rule, fixed = TRUE), 
                           gsub("coefficient", "importance", coef_inds)])
      } else { # otherwise, simply add importance(s)
        varimps$imp[i] <- varimps$imp[i] + 
          sum(baseimps$imp[grepl(varimps$varname[i], baseimps$rule, fixed = TRUE)])
      }
    } 

Expected Behavior: Variable importances should include all variables of the final model, including factors.

Proposed Solution: Move the factor importance calculation section inside the loop to properly scope factor variables.

sachseka commented 7 months ago

I've had the same problem... No (nonbinay) factor importances were calculated. Will there be an update? Thanks!!

schumaki commented 7 months ago

For a (possible) fix, please check my fork: https://github.com/schumaki/pre/commit/d0027e6316f6d91c28b84b3e6b1b44956ac5f92f

If I am not mistaken, it is sufficient to move a bracket to another line.

marjoleinF commented 7 months ago

Thanks for noticing and contributing! Package pre has been updated here on github, and is on its way to CRAN.

marjoleinF commented 7 months ago

Importances for factors are now correctly computed (factors tend to be picked up by the tree structure, to enforce picking up factors as linear terms I needed to use type = "linear"):

library("devtools")
install_github("marjoleinF/pre")
library("pre")
airq$Month <- factor(airq$Month)
set.seed(42)
airq.ens <- pre(Ozone ~ Month, data = airq, type = "linear")
imp <- importance(airq.ens)
imp$baseimps
##    rule description      imp coefficient        sd
## 1 Month7      Month7 3.031066    7.124545 0.4254400
## 2 Month8      Month8 2.577298    6.330198 0.4071434
imp$varimps
##  varname      imp
## 1   Month 5.608365
schumaki commented 7 months ago

Thank you very much for fixing the issue!

In my dataset, the variable importances are now calculated correctly (even picking up some factors as linear terms without type = "linear").

marjoleinF commented 7 months ago

The update is on CRAN now, too. Thanks very much, @schumaki, for reporting the problem and its source, and double checking the update!