GBM variable 1: Month is not of type numeric, ordered, or factor.

tobigithub commented 7 years ago

For gbm_2.1.1 and R 3.3.1 I get the following error for benchm-ml/3-boosting/1-gbm.R


> system.time({
+   md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
+             n.trees = 1000, 
+             interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
+             bag.fraction = 0.5)
+ })

Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  variable 1: Month is not of type numeric, ordered, or factor.
Timing stopped at: 0.02 0 0.01

Tobias

szilard commented 7 years ago

Yes, Month is of character type. Maybe read_csv used to read it as factor or gbm used to work with characters.

To make it work you can do d_train$Month <- as.factor(d_train$Month) and same for the other character columns and for d_test.

Btw read this https://github.com/szilard/benchm-ml/issues/35

tobigithub commented 7 years ago

Hi, GBM is usually a low/mid performer so I don't really know if its worth the effort to fix that error. GBM Binary classifier performance

Just an observation, I am not a fisherman. Actually reminds me of DLL hell, any R library update destroys any working R code, actually reminds me of unit testing/Microsoft.

Cheers Tobias

szilard commented 7 years ago

As I said gbm seems to need (now?) factors, you can do:

facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
for (k in facCols) {
  dx_train[[k]] <- as.factor(dx_train[[k]])
  dx_test[[k]] <- as.factor(dx_test[[k]])
}

taken from here https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8a-Rborist.R (another packages that needs that).

tobigithub commented 7 years ago

Hi, its awfully slow, I doubt anybody could ever run this in sequential mode. The "n.cores" setting does not do much. Not very efficient. Time: 3514.19 seconds for a quad core CPU (20-fold slower than H2O.gbm) and AUC = 0.7413561 (comparable to H2O.gbm)

See the corrected code for gbm_2.1.1 from benchm-ml/3-boosting/1-gbm.R


library(readr)
library(ROCR)
library(gbm)

set.seed(123)

d_train <- read_csv("train-1m.csv")
d_test <- read_csv("test.csv")

d_train$dep_delayed_15min <- ifelse(d_train$dep_delayed_15min=="Y",1,0)
d_test$dep_delayed_15min <- ifelse(d_test$dep_delayed_15min=="Y",1,0)

facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
numCols <- c("DepTime","Distance")

for (k in facCols) {
  d_train[[k]] <- as.factor(d_train[[k]])
  d_test[[k]] <- as.factor(d_test[[k]])
}

for (k in numCols) {
  d_train[[k]] <- as.numeric(d_train[[k]])
  d_test[[k]] <- as.numeric(d_test[[k]])
}

system.time({
  md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
            n.trees = 1000, 
            interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
            bag.fraction = 0.5, n.cores = 32)
})

phat <- predict(md, newdata = d_test, n.trees = md$n.trees, type = "response")
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
performance(rocr_pred, "auc")@y.values[[1]]

szilard commented 7 years ago

You still ran it on one core (n.cores is only for cross validation). Your result is comparable with https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines ~3000s (Time (s) A for R 1M)

H2O was 900s, xgboost 400s - on 32 cores.

If you ran multiple models, you can run them in parallel with gbm, while for h2o/xgb you already saturated all your cores with single runs, so you could just run them sequentially. So for that use case gbm is not that bad.

gideonkiplagat commented 2 years ago

I stiil a beginner in R, I can't understand the error. Kindly, anyone to explain it to me.

szilard / benchm-ml

GBM variable 1: Month is not of type numeric, ordered, or factor. #43