Better deal with errors about new factor levels

For factors with rare levels, with ntrees subsamples of (default) sampfrac of .5, it is quite likely that at least one of the subsamples will not contain this level. This yields an error when predicting for test observations, e.g.:

Error in model.frame.default(delete.response(object$terms), newdata, xlev = xlev) : factor occupation has new levels Armed-Forces

There are several potential ways of dealing with this: 1) Allow for new levels when predicting; assume the first level of the factor if a new level is observed. Not very easy to implement currently, and problematic because which level of a factor is the first is often quite arbitrary. 2) Increase sampfrac. Easy to do, but problematic, because not guaranteed to be succesful. 3) Supply a stratified sampling function to sampfrac argument, which forces the rare level of a factor into each subsample. I believe this is the preferred option, but cannot be employed by default, because number of factors and rare levels may be so large that stratified sampling is impossible.

TODO for option 3):

Print message if error about new factor levels occurs, that this error may be fixed by supplying a stratified sampling function to the sampfrac argument (or merely increasing the value).
Write and include a sampling-function-generating function, that can be supplied with sampling fraction, and the names of factors with rare levels, so it can guarantee that all factor levels will be present.

marjoleinF / pre

Better deal with errors about new factor levels #27