szilard / benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
MIT License
1.87k stars 335 forks source link

GBM variable 1: Month is not of type numeric, ordered, or factor. #43

Closed tobigithub closed 7 years ago

tobigithub commented 7 years ago

For gbm_2.1.1 and R 3.3.1 I get the following error for benchm-ml/3-boosting/1-gbm.R


> system.time({
+   md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
+             n.trees = 1000, 
+             interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
+             bag.fraction = 0.5)
+ })

Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  variable 1: Month is not of type numeric, ordered, or factor.
Timing stopped at: 0.02 0 0.01 

Tobias

szilard commented 7 years ago

Yes, Month is of character type. Maybe read_csv used to read it as factor or gbm used to work with characters.

To make it work you can do d_train$Month <- as.factor(d_train$Month) and same for the other character columns and for d_test.

Btw read this https://github.com/szilard/benchm-ml/issues/35

tobigithub commented 7 years ago

Hi, GBM is usually a low/mid performer so I don't really know if its worth the effort to fix that error. GBM Binary classifier performance

Just an observation, I am not a fisherman. Actually reminds me of DLL hell, any R library update destroys any working R code, actually reminds me of unit testing/Microsoft.

Cheers Tobias

szilard commented 7 years ago

As I said gbm seems to need (now?) factors, you can do:

facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
for (k in facCols) {
  dx_train[[k]] <- as.factor(dx_train[[k]])
  dx_test[[k]] <- as.factor(dx_test[[k]])
}

taken from here https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8a-Rborist.R (another packages that needs that).

tobigithub commented 7 years ago

Hi, its awfully slow, I doubt anybody could ever run this in sequential mode. The "n.cores" setting does not do much. Not very efficient. Time: 3514.19 seconds for a quad core CPU (20-fold slower than H2O.gbm) and AUC = 0.7413561 (comparable to H2O.gbm)

See the corrected code for gbm_2.1.1 from benchm-ml/3-boosting/1-gbm.R


library(readr)
library(ROCR)
library(gbm)

set.seed(123)

d_train <- read_csv("train-1m.csv")
d_test <- read_csv("test.csv")

d_train$dep_delayed_15min <- ifelse(d_train$dep_delayed_15min=="Y",1,0)
d_test$dep_delayed_15min <- ifelse(d_test$dep_delayed_15min=="Y",1,0)

facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
numCols <- c("DepTime","Distance")

for (k in facCols) {
  d_train[[k]] <- as.factor(d_train[[k]])
  d_test[[k]] <- as.factor(d_test[[k]])
}

for (k in numCols) {
  d_train[[k]] <- as.numeric(d_train[[k]])
  d_test[[k]] <- as.numeric(d_test[[k]])
}

system.time({
  md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
            n.trees = 1000, 
            interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
            bag.fraction = 0.5, n.cores = 32)
})

phat <- predict(md, newdata = d_test, n.trees = md$n.trees, type = "response")
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
performance(rocr_pred, "auc")@y.values[[1]]
szilard commented 7 years ago

You still ran it on one core (n.cores is only for cross validation). Your result is comparable with https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines ~3000s (Time (s) A for R 1M)

H2O was 900s, xgboost 400s - on 32 cores.

If you ran multiple models, you can run them in parallel with gbm, while for h2o/xgb you already saturated all your cores with single runs, so you could just run them sequentially. So for that use case gbm is not that bad.

gideonkiplagat commented 2 years ago

image I stiil a beginner in R, I can't understand the error. Kindly, anyone to explain it to me.