marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Error with factor levels #19

Closed carbonmetrics closed 5 years ago

carbonmetrics commented 5 years ago

When I run pre on my dataset, I get this:

> pre(m,train)
Error in model.frame.default(delete.response(object$terms), newdata, xlev = xlev) : 
  factor SAVI has new levels 3

Structure of train dataset:

Classes ‘data.table’ and 'data.frame':  532 obs. of  8 variables:
 $ slope        : num  0.03 0.09 0.03 0.22 0.12 0.04 0.09 0.05 0.08 0.04 ...
 $ dist_bound   : num  259 976 3423 1360 1584 ...
 $ dist_roads   : num  3631 796 2169 1640 624 ...
 $ TWI          : num  12.45 5.64 10.99 6.44 8.51 ...
 $ ASTER        : num  1803 1826 1835 1818 1827 ...
 $ interface_raw: num  1 0.43 0.32 1 0.33 0.9 0.83 1 0.84 0.98 ...
 $ SAVI         : Factor w/ 5 levels "1","2","3","4",..: 4 5 5 4 5 5 5 4 4 4 ...
 $ pb           : Factor w/ 2 levels "0","1": 1 2 2 1 2 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

and m:

> m
pb ~ slope + dist_bound + dist_roads + TWI + ASTER + interface_raw + SAVI

Somewhere there is a subsample that not covers all the factors? The dataset is small (532 rows) and attached. train.zip

marjoleinF commented 5 years ago

Thanks for reporting this issue!

The SAVI variable appears to have only a small number of observations with level "3":

> table(train$SAVI)

  1   2   3   4   5 
  0  34   4 282 212 

To force every subsample to have observations with SAVI level "3", you can specify a sampling function to pre()'s argument sampfrac. This option is somewhat hidden, but it is documented in the help file:

sampfrac numeric value > 0 and ≤ 1. [... ... ...] Alternatively, a sampling function may be supplied, which should take arguments n (sample size) and weights.

So, we need a stratified sub sampling function, which returns sampfrac*n observation ids, with (sampfrac * no. of obs. with level of SAVI) observation ids for every level of SAVI. For example:

strat_samp_func <- function(n, weights, sampfrac = .5) {
  out <- c()
  for (i in levels(data$SAVI)) {
    if (nrow(data[data$SAVI == i,]) > 1L) {
      out <- c(out, sample(1:n, size = sampfrac*nrow(data[data$SAVI == i, ]),
                           prob = as.numeric(data$SAVI == i)*weights, 
                           replace = FALSE))
      ## If want to do bootstrap sampling, specify replace = TRUE and sampfrac = 1
    }
  }
  out
}

An example of what this function will do within function pre():

> data <- train
> n <- nrow(train)
> weights <- rep(1, times = n)
> set.seed(1)
> sample_ids <- strat_samp_func(n, weights)
> length(sample_ids)
[1] 266
> head(sample_ids)
[1] 182 285 518 308  60 108

Now, we supply this function to the sampfrac argument:

> set.seed(1)
> pre.strat <- pre(m, data = train, sampfrac = strat_samp_func)
> summary(pre.strat)

Final ensemble with cv error within 1se of minimum: 
  lambda =  0.007349218
  number of terms = 31
  mean cv error (se) = 0.6774531 (0.05782462)

  cv error type : Binomial Deviance

Hope this helps!

carbonmetrics commented 5 years ago

Thanks Marjolein! This helped. An additional small issue was that extractRules does not work with data.table. This was solved with setDF(df).

marjoleinF commented 5 years ago

Excellent, I'm glad it works!

Function extractRules is not a function from package pre, perhaps it's from package inTrees? In any case, glad it was solved!