marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Better deal with errors about new factor levels #27

Closed marjoleinF closed 3 years ago

marjoleinF commented 3 years ago

For factors with rare levels, with ntrees subsamples of (default) sampfrac of .5, it is quite likely that at least one of the subsamples will not contain this level. This yields an error when predicting for test observations, e.g.:

Error in model.frame.default(delete.response(object$terms), newdata, xlev = xlev) : factor occupation has new levels Armed-Forces

There are several potential ways of dealing with this: 1) Allow for new levels when predicting; assume the first level of the factor if a new level is observed. Not very easy to implement currently, and problematic because which level of a factor is the first is often quite arbitrary. 2) Increase sampfrac. Easy to do, but problematic, because not guaranteed to be succesful. 3) Supply a stratified sampling function to sampfrac argument, which forces the rare level of a factor into each subsample. I believe this is the preferred option, but cannot be employed by default, because number of factors and rare levels may be so large that stratified sampling is impossible.

TODO for option 3):

marjoleinF commented 3 years ago

Fixed in version 1.0.1: