marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

undefined columns selected #25

Closed Naviden closed 3 years ago

Naviden commented 4 years ago

running data.ens <- pre(target ~ ., data = data2), after about 5 minutes I got the following error. Error in[.data.frame(data, , x_names) : undefined columns selected I can't understand what I'm doing wrong.

Some info: I have 900 observations, 498 numeric features (0/1) and a factor with 3 levels as target.

Naviden commented 4 years ago

Some additional information. Suspecting the problem is coming from having too many of features, I did some test and it seems the magic number is 73! going for the 74th column breaks the code. I also checked another hypothesis: some weird column names. The name of the 74th feature was "25k" ...first I though the problem is having a feature name starting with a number but removing it the problem remained.

marjoleinF commented 4 years ago

Hi Naviden,

Thanks for your message. Pre should be able to deal with quite a bit more than 74 features. I cannot reproduce this error with > 74 predictors and a multinomial response. Something else may be amiss.

Can you provide a reproducible example (a subset of the data with a smaller number of observations would be fine)?

If not possible, the result of running traceback() right after the error occurs could be helpful.

Best, Marjolein

Naviden commented 4 years ago

Hi Marjolein, Thanks for your quick answer. Here is the data: https://www.dropbox.com/s/0g7ny1niloqtiqo/data2.csv?dl=0

Navid

marjoleinF commented 4 years ago

I cannot reproduce the error, model seems to run fine, see code and output below.

To prepare the data, I did the following:

1) Set class of target to factor 2) Eliminate columns with non-zero variance (not required, but prevents a lot of warnings being printed)

data2 <- read.csv("data2.csv")
dim(data2)
## [1] 900 499
table(sapply(data2, class)) ## Need to set response variable to factor
##
## integer 
##    499 
length(which(sapply(data2, var) != 0)) ## Most variables have zero variance
## [1] 73
data2$target <- factor(data2$target)
check_if_dummy <- function(x) all(x %in% 0:1)
which(!sapply(data2, check_if_dummy))
## target 
##   499 
data2 <- data2[ , -which(sapply(data2[ , -499], mean) == 0)]
dim(data2)
## [1] 900  73

To fit the model, I set winsfrac = 0, because all the potential predictors are dummy coded and should not be winsorized. (This is not required, but eliminates many warnings. Alternatively, all predictors could have been coded as factors.)

set.seed(1)
data_ens <- pre(target ~ . , data = data2, winsfrac = 0, family = "multinomial")
## Warning in pre_rules(formula = formula, data = data, weights = weights,  :
##                       No prediction rules could be derived from dataset.

I get a warning, but no error. No trees were constructed; there does not seem to be enough information in the data to construct any tree; this results in the warning message. The lasso regression model did pick up some linear terms, however (note that rules would have a name including rule in column rule):

data_ens
## Final ensemble with cv error within 1se of minimum: 
##   lambda =  0.009463864
##   number of terms = 8
##   mean cv error (se) = 2.188356 (0.003343367)
## 
## cv error type : Multinomial Deviance
## 
##        rule  description  coefficient.2141  coefficient.2142  coefficient.2144
## (Intercept)            1       0.006870646     -1.374272e-02       0.006872076
##  automotive   automotive       0.676422456      0.000000e+00       0.000000000
##       meter        meter       0.000000000      1.482649e+00       0.000000000
##   assistant    assistant       0.000000000      7.072929e-01       0.000000000
##        X000         X000       0.000000000     -1.866800e-01       0.000000000
##           o            o       0.000000000      1.482638e+00       0.000000000
##           t            t       0.000000000      1.155962e-13       0.000000000
##   ordinator    ordinator       0.000000000      1.080452e+00       0.000000000
##         amp          amp       0.000000000      0.000000e+00       0.676419894

Selected R and package info:

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## other attached packages:
##  [1] glmnet_4.0     Matrix_1.2-18  partykit_1.2-7 mvtnorm_1.0-11 libcoin_1.0-5  pre_1.0.0