Closed tobigithub closed 7 years ago
Yes, Month
is of character
type. Maybe read_csv
used to read it as factor
or gbm
used to work with characters
.
To make it work you can do d_train$Month <- as.factor(d_train$Month)
and same for the other character
columns and for d_test
.
Btw read this https://github.com/szilard/benchm-ml/issues/35
Hi, GBM is usually a low/mid performer so I don't really know if its worth the effort to fix that error. GBM Binary classifier performance
Just an observation, I am not a fisherman. Actually reminds me of DLL hell, any R library update destroys any working R code, actually reminds me of unit testing/Microsoft.
Cheers Tobias
As I said gbm
seems to need (now?) factors, you can do:
facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
for (k in facCols) {
dx_train[[k]] <- as.factor(dx_train[[k]])
dx_test[[k]] <- as.factor(dx_test[[k]])
}
taken from here https://github.com/szilard/benchm-ml/blob/master/z-other-tools/8a-Rborist.R (another packages that needs that).
Hi, its awfully slow, I doubt anybody could ever run this in sequential mode. The "n.cores" setting does not do much. Not very efficient. Time: 3514.19 seconds for a quad core CPU (20-fold slower than H2O.gbm) and AUC = 0.7413561 (comparable to H2O.gbm)
See the corrected code for gbm_2.1.1 from benchm-ml/3-boosting/1-gbm.R
library(readr)
library(ROCR)
library(gbm)
set.seed(123)
d_train <- read_csv("train-1m.csv")
d_test <- read_csv("test.csv")
d_train$dep_delayed_15min <- ifelse(d_train$dep_delayed_15min=="Y",1,0)
d_test$dep_delayed_15min <- ifelse(d_test$dep_delayed_15min=="Y",1,0)
facCols <- c("UniqueCarrier", "Origin","Dest", "Month", "DayofMonth", "DayOfWeek")
numCols <- c("DepTime","Distance")
for (k in facCols) {
d_train[[k]] <- as.factor(d_train[[k]])
d_test[[k]] <- as.factor(d_test[[k]])
}
for (k in numCols) {
d_train[[k]] <- as.numeric(d_train[[k]])
d_test[[k]] <- as.numeric(d_test[[k]])
}
system.time({
md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli",
n.trees = 1000,
interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
bag.fraction = 0.5, n.cores = 32)
})
phat <- predict(md, newdata = d_test, n.trees = md$n.trees, type = "response")
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
performance(rocr_pred, "auc")@y.values[[1]]
You still ran it on one core (n.cores
is only for cross validation). Your result is comparable with https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines ~3000s (Time (s) A
for R
1M
)
H2O was 900s, xgboost 400s - on 32 cores.
If you ran multiple models, you can run them in parallel with gbm, while for h2o/xgb you already saturated all your cores with single runs, so you could just run them sequentially. So for that use case gbm is not that bad.
I stiil a beginner in R,
I can't understand the error.
Kindly, anyone to explain it to me.
For gbm_2.1.1 and R 3.3.1 I get the following error for benchm-ml/3-boosting/1-gbm.R
Tobias