marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Bug Report: Variable Importances Calculation in `importance.pre()` - Missing Loop Scope for Importance of Factor Variables #32

Closed schumaki closed 10 months ago

schumaki commented 11 months ago

I have identified a potential issue in the calculation of variable importances within the importance.pre() function in the source code here.

Description: It appears that the section responsible for obtaining importances for factors is referencing varimps$varname[i] outside the for loop, thus only capturing the last variable name when calculating factor importances.

This leads to missing importances for factor variables in the returned imps$varimps. Because otherwise only variables with zero importance are missing in imps$varimps, this might create a potentially misleading impression that all factor variables are omitted in the final model, while, in fact, their importances are not accurately captured.

Here is the relevant code snippet:

for(i in 1:nrow(varimps)) {
      ## Get imps from rules and linear functions
      for(j in 1:nrow(baseimps)) {

        ## if the variable name appears in the description (of rule or linear term):
        ##   (Note: EXACT matches are needed, so 1) there should be a space before 
        ##     and after the variable name in the rule and thus 2) there should be 
        ##     a space added before the description of the rule)
        if (grepl(paste0(" ", varimps$varname[i], " "), paste0(" ", baseimps$description[j]))) {
          ## Count the number of times it appears in the rule
          n_occ <- length(gregexpr(paste0(" ", varimps$varname[i], " "),
                                   paste0(" ", baseimps$description[j]), fixed = TRUE)[[1]])
          ## Add to the importance of the variable
          if (x$family %in% c("mgaussian", "multinomial")) {
            varimps[i, gsub("coefficient", "importance", coef_inds)] <- 
              varimps[i, gsub("coefficient", "importance", coef_inds)] + 
            (n_occ * baseimps[j, gsub("coefficient", "importance", coef_inds)] / baseimps$nterms[j])
          } else {
            varimps$imp[i] <- varimps$imp[i] + (n_occ * baseimps$imp[j] / baseimps$nterms[j])
          }
        }
      }
    } # <== this should be further below....

    ## Get imps for factors
    if (is.factor(x$data[ , varimps$varname[i]])) { # check if variable is a factor and add importance
        # !is.ordered(x$data[ , varimps$varname[i]])) {
      ## Sum baseimps$imp for which baseimps$rule has varimps$varname[i] as _part_ of its name
      if (x$family %in% c("mgaussian", "multinomial")) {
        varimps[i, gsub("coefficient", "importance", coef_inds)] <-
          varimps[i, gsub("coefficient", "importance", coef_inds)] +
          colSums(baseimps[grepl(varimps$varname[i], baseimps$rule, fixed = TRUE), 
                           gsub("coefficient", "importance", coef_inds)])
      } else { # otherwise, simply add importance(s)
        varimps$imp[i] <- varimps$imp[i] + 
          sum(baseimps$imp[grepl(varimps$varname[i], baseimps$rule, fixed = TRUE)])
      }
    } 

Expected Behavior: Variable importances should include all variables of the final model, including factors.

Proposed Solution: Move the factor importance calculation section inside the loop to properly scope factor variables.

sachseka commented 10 months ago

I've had the same problem... No (nonbinay) factor importances were calculated. Will there be an update? Thanks!!

schumaki commented 10 months ago

For a (possible) fix, please check my fork: https://github.com/schumaki/pre/commit/d0027e6316f6d91c28b84b3e6b1b44956ac5f92f

If I am not mistaken, it is sufficient to move a bracket to another line.

marjoleinF commented 10 months ago

Thanks for noticing and contributing! Package pre has been updated here on github, and is on its way to CRAN.

marjoleinF commented 10 months ago

Importances for factors are now correctly computed (factors tend to be picked up by the tree structure, to enforce picking up factors as linear terms I needed to use type = "linear"):

library("devtools")
install_github("marjoleinF/pre")
library("pre")
airq$Month <- factor(airq$Month)
set.seed(42)
airq.ens <- pre(Ozone ~ Month, data = airq, type = "linear")
imp <- importance(airq.ens)
imp$baseimps
##    rule description      imp coefficient        sd
## 1 Month7      Month7 3.031066    7.124545 0.4254400
## 2 Month8      Month8 2.577298    6.330198 0.4071434
imp$varimps
##  varname      imp
## 1   Month 5.608365
schumaki commented 10 months ago

Thank you very much for fixing the issue!

In my dataset, the variable importances are now calculated correctly (even picking up some factors as linear terms without type = "linear").

marjoleinF commented 10 months ago

The update is on CRAN now, too. Thanks very much, @schumaki, for reporting the problem and its source, and double checking the update!