nredell / forecastML

An R package with Python support for multi-step-ahead forecasting with machine learning and deep learning algorithms
Other
130 stars 23 forks source link

Error in predict.xgb.Booster(model, x) : Feature names stored in `object` and `newdata` are different! #34

Closed edgBR closed 4 years ago

edgBR commented 4 years ago

Dear Nick,

I was trying to make some predictions in the training set with both lasso and xgboost. Lasso works perfectly but with xgboost I get the following error:


> data_pred_cv_lasso <- predict(object = model_results_cv_lasso, prediction_function = list(prediction_function_lasso),  +                         data = data_train_list) > View(data_pred_cv_lasso) > data_pred_cv_xgboost <- predict(object = model_results_cv_xgb, prediction_function = list(prediction_function_xgboost),  +                         data = data_train_list) Error in predict.xgb.Booster(model, x) :    Feature names stored in `object` and `newdata` are different! Error in if (is.null(outcome_levels) && method == "direct" && !ncol(data_pred) %in%  :    missing value where TRUE/FALSE needed In addition: Warning message: In FUN(X[[i]], ...) :   Model 'xgboost' returned class 'try-error' for model 1 in validation window 1
--

My column names in xgboost are as follows:

  [1] "snsr_val"         "snsr_val_lag_4"   "snsr_val_lag_5"   "snsr_val_lag_6"   "snsr_val_lag_7"   "snsr_val_lag_8"  
  [7] "snsr_val_lag_9"   "snsr_val_lag_10"  "snsr_val_lag_11"  "snsr_val_lag_12"  "snsr_val_lag_13"  "snsr_val_lag_14" 
 [13] "snsr_val_lag_15"  "snsr_val_lag_16"  "snsr_val_lag_17"  "snsr_val_lag_18"  "snsr_val_lag_19"  "snsr_val_lag_20" 
 [19] "snsr_val_lag_21"  "snsr_val_lag_22"  "snsr_val_lag_23"  "snsr_val_lag_24"  "snsr_val_lag_25"  "snsr_val_lag_26" 
 [25] "snsr_val_lag_27"  "snsr_val_lag_28"  "snsr_val_lag_29"  "snsr_val_lag_30"  "snsr_val_lag_31"  "snsr_val_lag_32" 
 [31] "snsr_val_lag_33"  "snsr_val_lag_34"  "snsr_val_lag_35"  "snsr_val_lag_36"  "snsr_val_lag_37"  "snsr_val_lag_38" 
 [37] "snsr_val_lag_39"  "snsr_val_lag_40"  "snsr_val_lag_41"  "snsr_val_lag_42"  "snsr_val_lag_43"  "snsr_val_lag_44" 
 [43] "snsr_val_lag_45"  "snsr_val_lag_46"  "snsr_val_lag_47"  "snsr_val_lag_48"  "snsr_val_lag_49"  "snsr_val_lag_50" 
 [49] "snsr_val_lag_51"  "snsr_val_lag_52"  "snsr_val_lag_53"  "snsr_val_lag_54"  "snsr_val_lag_55"  "snsr_val_lag_56" 
 [55] "snsr_val_lag_57"  "snsr_val_lag_58"  "snsr_val_lag_59"  "snsr_val_lag_60"  "snsr_val_lag_61"  "snsr_val_lag_62" 
 [61] "snsr_val_lag_63"  "snsr_val_lag_64"  "snsr_val_lag_65"  "snsr_val_lag_66"  "snsr_val_lag_67"  "snsr_val_lag_68" 
 [67] "snsr_val_lag_69"  "snsr_val_lag_70"  "snsr_val_lag_71"  "snsr_val_lag_72"  "snsr_val_lag_73"  "snsr_val_lag_74" 
 [73] "snsr_val_lag_75"  "snsr_val_lag_76"  "snsr_val_lag_77"  "snsr_val_lag_78"  "snsr_val_lag_79"  "snsr_val_lag_80" 
 [79] "snsr_val_lag_81"  "snsr_val_lag_82"  "snsr_val_lag_83"  "snsr_val_lag_84"  "snsr_val_lag_85"  "snsr_val_lag_86" 
 [85] "snsr_val_lag_87"  "snsr_val_lag_88"  "snsr_val_lag_89"  "snsr_val_lag_90"  "snsr_val_lag_91"  "snsr_val_lag_92" 
 [91] "snsr_val_lag_93"  "snsr_val_lag_94"  "snsr_val_lag_95"  "snsr_val_lag_96"  "snsr_val_lag_97"  "snsr_val_lag_98" 
 [97] "snsr_val_lag_99"  "snsr_val_lag_100" "snsr_val_lag_101" "snsr_val_lag_102" "snsr_val_lag_103" "snsr_val_lag_104"
[103] "db_src"           "index.num"        "year"             "year.iso"         "half"             "quarter"         
[109] "month"            "month.xts"        "day"              "mday"             "qday"             "yday"            
[115] "mweek"            "week"             "week.iso"         "week2"            "week3"            "week4"           
[121] "mday7"            "snsr_val_roll_4"  "snsr_val_roll_12" "snsr_val_roll_24" "snsr_val_roll_36" "snsr_val_roll_48"
[127] "snsr_val_roll_52"

My grouping column "snsr_key" is missing there but I am not removing it in the training function:


# The value of outcome_col can also be set in train_model() with train_model(outcome_col = 1).
model_function_xgboost <- function(data, outcome_col = 2) {

  # xgboost cannot handle missing outcomes data.
  data <- data %>% drop_na()
  data <- data %>% select_if(negate(is.character))
  indices <- 1:nrow(data)

  set.seed(224)
  train_indices <- sample(1:nrow(data), ceiling(nrow(data) * .8), replace = FALSE)
  test_indices <- indices[!(indices %in% train_indices)]

  data_train <- xgboost::xgb.DMatrix(data = as.matrix(data[train_indices, 
                                                           -(outcome_col), drop = FALSE]),
                                     label = as.matrix(data[train_indices, 
                                                            outcome_col, drop = FALSE]))

  data_test <- xgboost::xgb.DMatrix(data = as.matrix(data[test_indices, 
                                                          -(outcome_col), drop = FALSE]),
                                    label = as.matrix(data[test_indices, 
                                                           outcome_col, drop = FALSE]))

  params <- list("objective" = "reg:squarederror")
  watchlist <- list(train = data_train, test = data_test)

  model <- xgboost::xgb.train(data = data_train, params = params, 
                              max.depth = 8, nthread = 5, nrounds = 100,
                              metrics = "rmse", verbose = 1, 
                              early_stopping_rounds = 5, 
                              watchlist = watchlist)

  return(model)
}

I thought that the train_model functions was by default picking the groups attribute from the lagged dataframe but for some reason now is not working.

Do you have any idea of what is going wrong? This was working before in the 0.8.0 version.

BR /Edgar

nredell commented 4 years ago

I have an idea. It was a breaking change that I made in this last version. While the input data to create_lagged_df() can have the outcome in any position, the output of create_lagged_df() moves the outcome to the first column in each dataset. From help("create_lagged_df").

The column index–an integer–of the target to be forecasted. If outcome_col != 1, the outcome column will be moved to position 1 and outcome_col will be set to 1 internally.

I can think of 1 or 2 more small potentially breaking changes that will appear for v1.0.0, at which point the API will be backward compatible, and any future breaking changes will be fully documented at a website.

Thanks for opening the issue.

nredell commented 4 years ago

https://github.com/nredell/forecastML/issues/35