Open szilard opened 3 years ago
xgboost dense 0.17 sec/tree, agtboost 11.1 sec/tree, 60x
scoring (100K records): xgboost sparse 0.13sec, xgboost dense 1.26sec, agtboost 21.3sec
AUC: untuned xgboost 0.7324224, agtboost 0.7247399
From the project's README:
Any help on the following subjects are especially welcome:
Utilizing sparsity (possibly Eigen sparsity).
Paralellizatin (CPU and/or GPU).
@szilard first of all thank you for testing aGTBoost!
The short reply to this is that aGTBoost is not yet optimized for speed.
The focus of aGTBoost has been to develop an information criterion for gradient tree boosting that should not incur too much overhead, so to remove the (computational and code) burdens of cross-validation for tuning of hyperparameters. The research has resulted in these two papers:
In practice, nrounds
, max_depth
and gamma
are removed entirely. Which admits the computational benefits of aGTBoost.
And further research is coming on how to automatically tune regularization of the objective, e.g. lambda
in xgboost and stochastic sampling, and perhaps a non-exact greedy algorithm.
As of now, however, aGTBoost should in principle be equivalent to dense deterministic xgboost tree_method="exact"
, nthread=1
and lambda=0
, with optimal selection of nrounds
, max_depth
and gamma
. See e.g. Tables 3, 5 and 6 of the former paper.
In practice, it is not for large sized datasets, and these faults are my own (and not the theory).
In code, any faults to not reach the speed of the below version of xgboost are due to my own shortcomings:
system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.01, lambda=0, nthread=1,
tree_method = "exact")
})
For non-large datasets, the computation times should be fairly similar from what I have measured. But with the advantage of aGTBoost to not require any tuning.
But, this is still a research project.
Automatic tuning of e.g. lambda
and stochastic sampling are coming.
And hopefully I will also be able to implement sparsity, parallelization, or get some help from the amazing comunity to do this. Or that the theory is implemented entirely by someone else. :)
I am very happy that you have tested aGTBoost, and would greatly appreciate it if this issue could be kept open, for me to get some more time to implement the above mentioned functionality (or for others to see this and provide assistance), and then test again.
Thank you!
Great work @Blunde1 and thanks for explanations.
Absolutely, I'm happy to keep this issue open so we can post findings and updated results here.
Thanks for clarifying aGTBoost is so far similar to tree_method = "exact"
Therefore here is a quick run of xgb exact for comparison:
system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.1,
tree_method = "exact")
})
# user system elapsed
# 629.899 0.516 81.480
# AUC: 0.7330478
system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.1,
nthread = 1,
tree_method = "exact")
})
# user system elapsed
# 327.831 0.089 327.894
Btw the server I'm testing on is a modest AWS instance with 8 cores and I'm using a relatively old version of xgb (0.90.0.2).
So using histogram instead of exact will speed up training (and even more for larger datasets). Not sure if aGTBoost can be amended to have something similar.
And of course sparsity and parallelization would speed things up as well, see my comments on the already opened issues https://github.com/Blunde1/agtboost/issues/18 and https://github.com/Blunde1/agtboost/issues/24
Looking at the complexity of trees:
aGTBoost:
md$get_num_leaves()
[1] 2 2 2 2 2 2 2 6 9 16 6 2 4 6 4 15 8 2 10 18 8 7 33 6 22 2 6 16 6 28 18 21 21 29
[35] 6 27 3 23 15 24 7 17 18 7 28 15 6 13 4 23 2 13 12 13 2 6 4 19 14 18 31 5 45 3 16 3 13 4
[69] 2 3 2 7 2 2 2 11 2 4 3 2 4 2 8 7 2 2 4 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2
[103] 2 2 4 2 2 2 2 3 2 2 2 2 2 2 2 2 5 2 4 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
[137] 2 2 2 2 2 2 2 3 4 2 2
xgboost:
md_txt <- xgb.dump(md)
diff(grep("booster",md_txt[grep("leaf|booster",md_txt)]))-1
[1] 400 376 383 385 392 391 378 395 364 382 376 385 372 349 354 350 331 356 329 314 330 311 337 315 291 301
[27] 276 272 290 299 278 263 242 193 324 273 249 170 261 266 196 272 143 256 209 138 169 162 229 177 186 173
[53] 166 177 153 193 127 140 202 108 162 199 157 189 175 160 117 102 203 63 120 52 156 60 84 90 55 33
[79] 51 164 102 50 119 119 165 69 70 46 86 28 135 92 91 105 55 134 32 110 127
Looks like aGTBoost trees are very shallow, yet they can achieve relatively good accuracy AUC 0.724 vs 0.732 for xgboost. Any comments on this @Blunde1 ? (UPDATE: also see below lightgbm results, AUC is not bad even for shallow trees)
Full code here:
library(data.table)
library(ROCR)
library(xgboost)
library(agtboost)
library(Matrix)
set.seed(123)
d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")
X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)
## xgboost
dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)
system.time({
md <- xgb.train(data = dxgb_train,
objective = "binary:logistic",
nround = 100, max_depth = 10, eta = 0.1,
tree_method = "hist")
})
phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
md_txt <- xgb.dump(md)
diff(grep("booster",md_txt[grep("leaf|booster",md_txt)]))-1
## agtboost
system.time({
md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss",
learning_rate = 0.1, nrounds = 10000, verbose = 1)
})
phat <- predict(md, X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
md$get_num_leaves()
For lightgbm:
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
[1] 325 290 306 301 325 303 318 324 311 305 276 271 292 284 267 239 206 233 245 225 228 224 210 208 176 251
[27] 195 212 202 157 198 181 189 147 128 143 151 153 155 163 146 176 138 126 134 84 115 112 142 96 57 81
[53] 111 136 102 70 67 72 99 105 42 96 50 109 146 89 139 27 62 120 141 50 87 107 113 72 36 111
[79] 116 136 100 107 93 33 84 134 53 102 25 87 56 104 115 49 58 92 112 64 70 40
dlgb_train <- lgb.Dataset(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))
system.time({
md <- lgb.train(data = dlgb_train,
objective = "binary",
nrounds = 100, num_leaves = 10000, max_depth = 10, learning_rate = 0.1,
verbose = 0)
})
phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
md_dt <- lgb.model.dt.tree(md)
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
or as usually used without max_depth
:
system.time({
md <- lgb.train(data = dlgb_train,
objective = "binary",
nrounds = 100, num_leaves = 512, learning_rate = 0.1,
verbose = 0)
})
phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
md_dt <- lgb.model.dt.tree(md)
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
[1] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
[27] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
[53] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
[79] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
Restricting num_leaves
to pretty low values:
num_leaves | AUC |
---|---|
1000 | 0.7276909 |
100 | 0.7343772 |
10 | 0.7243808 |
4 | 0.7155241 |
2 | 0.7015478 |
so AUC is similar to aGTBoost's when num_leaves=10
@szilard I will try to comment on the above results in the context of aGTBoost.
First of all, it is worth mentioning that there are two flavours of aGTBoost, which difference lie in how they employ
the information criterion in the two referenced papers.
These methods can be used by choosing algorithm="global_subset"
(default) or algorithm="vanilla"
.
The default method will give the most shallow trees, but to an almost identical performance as the other.
Choosing the vanilla method, you would (probably) have seen that the first trees are rather deep, but then becomes smaller and smaller (as there are less and less information to learn from the training data), until they are mere tree-stump in the end. The differences are explained in the latter paper. The vanilla method is used in the former paper.
However, both of these methods employ the mentioned information criterion. This information criterion is essentially a (rather thight, and for the important cases usually exact) bound, and its employment typically results in "sparse" models over complex ones.
In particular, the implementation of the information criterion contains an independence assumption, which will in certain cases (when the design matrix is large and features are highly correlated) result in models that could benefit a little bit from slightly higher complexity. This seems to be the case for the above tests: The design matrix has 683 feature columns, that are likely highly correlated as they result from one-hot encoding on only 8 features.
Certain things on the research side are likely to work against this: Optimal stochastic sampling (which will allow higher complexity due to the effects averaging or "bootstrap integration" of randomness), optimal L2 regularization, and obviously working on implementing the dependence effects directly.
Again, an example in code. This is pretty much the optimal setup for agtboost implementation right now:
# Load
library(agtboost)
library(xgboost)
# Dimensions
n <- 1000
m <- 1000
# Generate data
x_tr <- cbind(runif(n,-5,5), matrix(rnorm(n*(m-1)), nrow=n, ncol=m-1))
x_te <- cbind(runif(n,-5,5), matrix(rnorm(n*(m-1)), nrow=n, ncol=m-1))
y_tr <- rnorm(n, x_tr[,1]^2, 2) # Only first feature is significant, 999 is noise
y_te <- rnorm(n, x_te[,1]^2, 2)
# Training
mod <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose=10)
# Predict
pred <- predict(mod, x_te)
plot(x_te[,1], y_te, xlab="relevant feature", ylab="response", main="Test data")
points(x_te[,1], pred, col=2)
# Loss
mean((y_te - pred)^2)
# [1] 4.635563
# xgb -- this should really be tuned
dxgb_train <- xgb.DMatrix(data = x_tr, label = y_tr)
md <- xgb.train(data = dxgb_train,
nround = 1000, max_depth = 10, eta = 0.01,
tree_method = "exact")
# xgb pred
pred_xgb <- predict(md, x_te)
points(x_te[,1], pred_xgb, col=3)
legend("bottomleft",c("agtboost","xgb"), col=c(2,3), pch=c(1,1))
# xgb loss
mean((y_te - pred_xgb)^2)
# [1] 4.864288
Of course, most real data has lots of dependence, which imply that an agtboost model might need a little bit more complexity for optimal predictions (but not much). On the positive side, they are likely to not overfit, and the underfit will be very small. And, again, more research and corresponding implementations for aGTBoost are coming on exactly this.
It would be very interesting to see what an optimally tuned xgboost model would do in the above example. My bet is that due to stochastic sampling and L2 regularization it might improve slightly on aGTBoost, but that this should be dominated by aGTBoost after the coming releases. Any thoughts on this?
Something I forgot to mention in the previous post, but which is relevant: The scheduled updates for aGTBoost are in the following order (as long as I am the only one working on it):
Papers on the above points will of course also follow...
Thank you @Blunde1 for insights into aGTBoost.
I tried algorithm="vanilla"
on the airline data and indeed it shows " the first trees are rather deep, but then becomes smaller and smaller...":
it: 1 | n-leaves: 178 | tr loss: 0.476 | gen loss: 0.477
it: 2 | n-leaves: 104 | tr loss: 0.468 | gen loss: 0.4699
it: 3 | n-leaves: 137 | tr loss: 0.4616 | gen loss: 0.4642
it: 4 | n-leaves: 68 | tr loss: 0.457 | gen loss: 0.4599
it: 5 | n-leaves: 74 | tr loss: 0.4528 | gen loss: 0.4562
it: 6 | n-leaves: 54 | tr loss: 0.4495 | gen loss: 0.4533
it: 7 | n-leaves: 54 | tr loss: 0.4466 | gen loss: 0.4507
it: 8 | n-leaves: 67 | tr loss: 0.4439 | gen loss: 0.4484
.......
it: 23 | n-leaves: 17 | tr loss: 0.4285 | gen loss: 0.4364
it: 24 | n-leaves: 12 | tr loss: 0.4282 | gen loss: 0.4362
it: 25 | n-leaves: 14 | tr loss: 0.4279 | gen loss: 0.436
it: 26 | n-leaves: 28 | tr loss: 0.4274 | gen loss: 0.4357
it: 27 | n-leaves: 8 | tr loss: 0.4272 | gen loss: 0.4356
it: 28 | n-leaves: 12 | tr loss: 0.4269 | gen loss: 0.4354
it: 29 | n-leaves: 2 | tr loss: 0.4269 | gen loss: 0.4354
it: 30 | n-leaves: 6 | tr loss: 0.4268 | gen loss: 0.4354
it: 31 | n-leaves: 9 | tr loss: 0.4266 | gen loss: 0.4353
it: 32 | n-leaves: 3 | tr loss: 0.4265 | gen loss: 0.4352
it: 33 | n-leaves: 14 | tr loss: 0.4263 | gen loss: 0.4351
it: 34 | n-leaves: 7 | tr loss: 0.4262 | gen loss: 0.435
it: 35 | n-leaves: 7 | tr loss: 0.426 | gen loss: 0.435
it: 36 | n-leaves: 2 | tr loss: 0.426 | gen loss: 0.435
it: 37 | n-leaves: 4 | tr loss: 0.426 | gen loss: 0.4349
it: 38 | n-leaves: 2 | tr loss: 0.4259 | gen loss: 0.4349
it: 39 | n-leaves: 2 | tr loss: 0.4259 | gen loss: 0.4349
.......
it: 113 | n-leaves: 2 | tr loss: 0.424 | gen loss: 0.4341
it: 114 | n-leaves: 2 | tr loss: 0.424 | gen loss: 0.4341
it: 115 | n-leaves: 2 | tr loss: 0.424 | gen loss: 0.4341
it: 116 | n-leaves: 2 | tr loss: 0.4239 | gen loss: 0.4341
it: 117 | n-leaves: 2 | tr loss: 0.4239 | gen loss: 0.4341
the code and training time is:
system.time({
md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss",
algorithm="vanilla",
learning_rate = 0.1, nrounds = 10000, verbose = 1)
})
phat <- predict(md, X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
md$get_num_leaves()
user system elapsed
1469.704 0.403 1469.800
0.7251307
It's good to have your understanding of the algo, thanks again for input.
Thanks also for the simulation example above. I was looking at that and shared with @Laurae2 ("It would be very interesting to see what an optimally tuned xgboost model would do in the above example.") and he'll answer here himself.
I understand your priorities for the next updates, and it makes perfect sense in the academic setting you are in (obviously you want to do research and finish the PhD etc.) However, if the goal was (say later) for aGTBoost to become used widely by practitioners, then it would need to be fast/efficient (enough) to be able to run competitively with xgboost/lightgbm on datasets of say 100K-1M records (apart from some niche use cases with small data, say ~1000 records e.g. in the medical field or so). I'm not saying this must be the goal, it is perfectly OK to keep it a research project, heck, it's already a very good one.
Dumb "tuning" abusing knowledge of only 1 feature is relevant allows us to choose a proper more maximum depth of 2 instead of an absurd 10, which leads us out of the box to a better result than agtboost.
Algorithm | ^2 error | Timing (3.7 GHz, 72 threads) |
---|---|---|
agboost global_subset | 4.398976 | (Singlethreaded) 102.818 seconds |
agtboost vanilla | 4.375327 | (Singlethreaded) 130.814 seconds |
xgboost untuned | 4.348369 | (Multithreaded) 3.205 seconds |
Timings using a server with Dual Xeon 6154 (fixed 3.7 GHz, 36 physical cores) and 768 GB RAM (2666 MHz)
xgboost has no validation set and is untuned, and thus may perform better.
Note that the example no random seed set for reproducibility. I've included the example using seed 1 below.
# Load
library(agtboost)
library(xgboost)
# Dimensions
n <- 1000
m <- 1000
set.seed(1)
# Generate data
x_tr <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
x_te <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
y_tr <- rnorm(n, x_tr[,1]^2, 2) # Only first feature is significant, 999 is noise
y_te <- rnorm(n, x_te[,1]^2, 2)
# Training agtboost normal
# it: 430 | n-leaves: 2 | tr loss: 4.064 | gen loss: 7.901
# user system elapsed
# 102.843 0.000 102.818
system.time({mod <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10)})
# Predict
pred <- predict(mod, x_te)
mean((y_te - pred)^2)
# [1] 4.398976
# Training agtboost vanilla
# it: 380 | n-leaves: 2 | tr loss: 4.045 | gen loss: 8.565
# user system elapsed
# 130.827 0.000 130.814
system.time({mod_van <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10, algorithm = "vanilla")})
# Predict
pred_van <- predict(mod_van, x_te)
mean((y_te - pred_van)^2)
# [1] 4.375327
# xgb with no validation set
# user system elapsed
# 230.341 0.000 3.205
dxgb_train <- xgb.DMatrix(data = x_tr, label = y_tr)
system.time({md <- xgb.train(data = dxgb_train,
nround = 1000, max_depth = 2, eta = 0.01,
verbose = 1, print_every_n = 10,
tree_method = "exact", nthread = 72)})
# xgb pred
pred_xgb <- predict(md, x_te)
mean((y_te - pred_xgb)^2)
# [1] 4.348369
New implementation aGTBoost https://github.com/Blunde1/agtboost