holub008 / xrf

eXtreme RuleFit (sparse linear models on XGBoost ensembles)
Other
43 stars 13 forks source link

Gracefully handle NAs in predictors #8

Open holub008 opened 5 years ago

holub008 commented 5 years ago

Currently, if the response contains an NA, a clear error message is thrown:

data <- data.frame(x = rnorm(50), y = c(rnorm(49), NA))
m <- xrf(y ~x, data, family = 'gaussian', xgb_control = list(nrounds=1, max_depth=2))

Error in xrf_preconditions(family, xgb_control, glm_control, data, response_var,  : 
  Response variable contains missing values which is not allowed

However, if any predictor contains an NA, the *model.matrix implementation will silently drop the row, which results in confusing errors:

data <- data.frame(y = rnorm(50), x = c(rnorm(49), NA))
m <- xrf(y ~x, data, family = 'gaussian', xgb_control = list(nrounds=1, max_depth=2))

Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) : 
  The length of labels must equal to the number of rows in the input data

Several fixes may make sense: