aGTBoost - Githubissues

szilard / GBM-perf

Performance of various open source GBM implementations

MIT License

213 stars 28 forks source link

aGTBoost #35

Open szilard opened 3 years ago

szilard commented 3 years ago

New implementation aGTBoost https://github.com/Blunde1/agtboost

library(data.table)
library(ROCR)
library(xgboost)
library(agtboost)
library(Matrix)

set.seed(123)

d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")

## xgboost sparse

X_train_test <- sparse.model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})
# user  system elapsed 
# 95.708   0.733  16.304 

system.time({
phat <- predict(md, newdata = X_test)
})
# user  system elapsed 
# 1.002   0.000   0.134 
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7324224

## xgboost dense

X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})
# user  system elapsed 
# 109.702   1.095  17.444 

system.time({
  phat <- predict(md, newdata = X_test)
})
# user  system elapsed 
# 2.287   0.377   1.262
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7324224 

## agtboost 

system.time({
md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss", 
                learning_rate = 0.1, nrounds = 10000, verbose = 1)
})
# it: 147  |  n-leaves: 3  |  tr loss: 0.426  |  gen loss: 0.4333
# it: 148  |  n-leaves: 2  |  tr loss: 0.426  |  gen loss: 0.4333
# user   system  elapsed 
# 1643.150    0.535 1643.348 

system.time({
phat <- predict(md, X_test)
})
# user  system elapsed 
# 21.172   0.205  21.374
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
# 0.7247399

szilard commented 3 years ago

xgboost dense 0.17 sec/tree, agtboost 11.1 sec/tree, 60x

scoring (100K records): xgboost sparse 0.13sec, xgboost dense 1.26sec, agtboost 21.3sec

AUC: untuned xgboost 0.7324224, agtboost 0.7247399

szilard commented 3 years ago

From the project's README:

Any help on the following subjects are especially welcome:

Utilizing sparsity (possibly Eigen sparsity).
Paralellizatin (CPU and/or GPU).

Blunde1 commented 3 years ago

@szilard first of all thank you for testing aGTBoost!

The short reply to this is that aGTBoost is not yet optimized for speed.

The focus of aGTBoost has been to develop an information criterion for gradient tree boosting that should not incur too much overhead, so to remove the (computational and code) burdens of cross-validation for tuning of hyperparameters. The research has resulted in these two papers:

In practice, nrounds, max_depth and gamma are removed entirely. Which admits the computational benefits of aGTBoost. And further research is coming on how to automatically tune regularization of the objective, e.g. lambda in xgboost and stochastic sampling, and perhaps a non-exact greedy algorithm.

As of now, however, aGTBoost should in principle be equivalent to dense deterministic xgboost tree_method="exact", nthread=1 and lambda=0, with optimal selection of nrounds, max_depth and gamma. See e.g. Tables 3, 5 and 6 of the former paper. In practice, it is not for large sized datasets, and these faults are my own (and not the theory).

In code, any faults to not reach the speed of the below version of xgboost are due to my own shortcomings:

system.time({
    md <- xgb.train(data = dxgb_train, 
                    objective = "binary:logistic", 
                    nround = 100, max_depth = 10, eta = 0.01, lambda=0, nthread=1,
                    tree_method = "exact")
})

For non-large datasets, the computation times should be fairly similar from what I have measured. But with the advantage of aGTBoost to not require any tuning.

But, this is still a research project. Automatic tuning of e.g. lambda and stochastic sampling are coming. And hopefully I will also be able to implement sparsity, parallelization, or get some help from the amazing comunity to do this. Or that the theory is implemented entirely by someone else. :)

I am very happy that you have tested aGTBoost, and would greatly appreciate it if this issue could be kept open, for me to get some more time to implement the above mentioned functionality (or for others to see this and provide assistance), and then test again.

Thank you!

szilard commented 3 years ago

Great work @Blunde1 and thanks for explanations.

Absolutely, I'm happy to keep this issue open so we can post findings and updated results here.

szilard commented 3 years ago

Thanks for clarifying aGTBoost is so far similar to tree_method = "exact"

Therefore here is a quick run of xgb exact for comparison:

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "exact")
})
# user  system elapsed 
# 629.899   0.516  81.480

# AUC: 0.7330478

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  nthread = 1,
                  tree_method = "exact")
})
# user  system elapsed 
# 327.831   0.089 327.894

Btw the server I'm testing on is a modest AWS instance with 8 cores and I'm using a relatively old version of xgb (0.90.0.2).

szilard commented 3 years ago

So using histogram instead of exact will speed up training (and even more for larger datasets). Not sure if aGTBoost can be amended to have something similar.

And of course sparsity and parallelization would speed things up as well, see my comments on the already opened issues https://github.com/Blunde1/agtboost/issues/18 and https://github.com/Blunde1/agtboost/issues/24

szilard commented 3 years ago

Looking at the complexity of trees:

aGTBoost:

md$get_num_leaves()
  [1]  2  2  2  2  2  2  2  6  9 16  6  2  4  6  4 15  8  2 10 18  8  7 33  6 22  2  6 16  6 28 18 21 21 29
 [35]  6 27  3 23 15 24  7 17 18  7 28 15  6 13  4 23  2 13 12 13  2  6  4 19 14 18 31  5 45  3 16  3 13  4
 [69]  2  3  2  7  2  2  2 11  2  4  3  2  4  2  8  7  2  2  4  2  3  2  2  2  2  2  3  2  2  2  2  2  2  2
[103]  2  2  4  2  2  2  2  3  2  2  2  2  2  2  2  2  5  2  4  2  2  2  2  2  2  2  2  2  2  2  2  3  2  2
[137]  2  2  2  2  2  2  2  3  4  2  2

xgboost:

md_txt <- xgb.dump(md)
diff(grep("booster",md_txt[grep("leaf|booster",md_txt)]))-1
[1] 400 376 383 385 392 391 378 395 364 382 376 385 372 349 354 350 331 356 329 314 330 311 337 315 291 301
[27] 276 272 290 299 278 263 242 193 324 273 249 170 261 266 196 272 143 256 209 138 169 162 229 177 186 173
[53] 166 177 153 193 127 140 202 108 162 199 157 189 175 160 117 102 203  63 120  52 156  60  84  90  55  33
[79]  51 164 102  50 119 119 165  69  70  46  86  28 135  92  91 105  55 134  32 110 127

Looks like aGTBoost trees are very shallow, yet they can achieve relatively good accuracy AUC 0.724 vs 0.732 for xgboost. Any comments on this @Blunde1 ? (UPDATE: also see below lightgbm results, AUC is not bad even for shallow trees)

Full code here:

library(data.table)
library(ROCR)
library(xgboost)
library(agtboost)
library(Matrix)

set.seed(123)

d_train <- fread("https://github.com/szilard/benchm-ml--data/raw/master/train-0.1m.csv")
d_test <- fread("https://github.com/szilard/benchm-ml--data/raw/master/test.csv")

X_train_test <- model.matrix(dep_delayed_15min ~ .-1, data = rbind(d_train, d_test))
n1 <- nrow(d_train)
n2 <- nrow(d_test)
X_train <- X_train_test[1:n1,]
X_test <- X_train_test[(n1+1):(n1+n2),]
y_train <- ifelse(d_train$dep_delayed_15min=='Y',1,0)

## xgboost

dxgb_train <- xgb.DMatrix(data = X_train, label = y_train)

system.time({
  md <- xgb.train(data = dxgb_train, 
                  objective = "binary:logistic", 
                  nround = 100, max_depth = 10, eta = 0.1, 
                  tree_method = "hist")
})

phat <- predict(md, newdata = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

md_txt <- xgb.dump(md)
diff(grep("booster",md_txt[grep("leaf|booster",md_txt)]))-1

## agtboost 

system.time({
md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss", 
                learning_rate = 0.1, nrounds = 10000, verbose = 1)
})

phat <- predict(md, X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

md$get_num_leaves()

szilard commented 3 years ago

For lightgbm:

md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
  [1] 325 290 306 301 325 303 318 324 311 305 276 271 292 284 267 239 206 233 245 225 228 224 210 208 176 251
 [27] 195 212 202 157 198 181 189 147 128 143 151 153 155 163 146 176 138 126 134  84 115 112 142  96  57  81
 [53] 111 136 102  70  67  72  99 105  42  96  50 109 146  89 139  27  62 120 141  50  87 107 113  72  36 111
 [79] 116 136 100 107  93  33  84 134  53 102  25  87  56 104 115  49  58  92 112  64  70  40

dlgb_train <- lgb.Dataset(data = X_train, label = ifelse(d_train$dep_delayed_15min=='Y',1,0))

system.time({
  md <- lgb.train(data = dlgb_train, 
                  objective = "binary", 
                  nrounds = 100, num_leaves = 10000, max_depth = 10, learning_rate = 0.1, 
                  verbose = 0)
})

phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

md_dt <- lgb.model.dt.tree(md)
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves

szilard commented 3 years ago

or as usually used without max_depth:

system.time({
  md <- lgb.train(data = dlgb_train, 
                  objective = "binary", 
                  nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
                  verbose = 0)
})

phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

md_dt <- lgb.model.dt.tree(md)
md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves

md_dt[,.(leaves=sum(is.na(split_index))),by=tree_index]$leaves
  [1] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
 [27] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
 [53] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512
 [79] 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512 512

szilard commented 3 years ago

Restricting num_leaves to pretty low values:

num_leaves	AUC
1000	0.7276909
100	0.7343772
10	0.7243808
4	0.7155241
2	0.7015478

so AUC is similar to aGTBoost's when num_leaves=10

Blunde1 commented 3 years ago

@szilard I will try to comment on the above results in the context of aGTBoost.

First of all, it is worth mentioning that there are two flavours of aGTBoost, which difference lie in how they employ the information criterion in the two referenced papers. These methods can be used by choosing algorithm="global_subset" (default) or algorithm="vanilla". The default method will give the most shallow trees, but to an almost identical performance as the other. Choosing the vanilla method, you would (probably) have seen that the first trees are rather deep, but then becomes smaller and smaller (as there are less and less information to learn from the training data), until they are mere tree-stump in the end. The differences are explained in the latter paper. The vanilla method is used in the former paper.

However, both of these methods employ the mentioned information criterion. This information criterion is essentially a (rather thight, and for the important cases usually exact) bound, and its employment typically results in "sparse" models over complex ones.

In particular, the implementation of the information criterion contains an independence assumption, which will in certain cases (when the design matrix is large and features are highly correlated) result in models that could benefit a little bit from slightly higher complexity. This seems to be the case for the above tests: The design matrix has 683 feature columns, that are likely highly correlated as they result from one-hot encoding on only 8 features.

Certain things on the research side are likely to work against this: Optimal stochastic sampling (which will allow higher complexity due to the effects averaging or "bootstrap integration" of randomness), optimal L2 regularization, and obviously working on implementing the dependence effects directly.

Again, an example in code. This is pretty much the optimal setup for agtboost implementation right now:

# Load
library(agtboost)
library(xgboost)

# Dimensions
n <- 1000
m <- 1000

# Generate data
x_tr <- cbind(runif(n,-5,5), matrix(rnorm(n*(m-1)), nrow=n, ncol=m-1))
x_te <- cbind(runif(n,-5,5), matrix(rnorm(n*(m-1)), nrow=n, ncol=m-1))
y_tr <- rnorm(n, x_tr[,1]^2, 2) # Only first feature is significant, 999 is noise
y_te <- rnorm(n, x_te[,1]^2, 2)

# Training
mod <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose=10)

# Predict
pred <- predict(mod, x_te)
plot(x_te[,1], y_te, xlab="relevant feature", ylab="response", main="Test data")
points(x_te[,1], pred, col=2)

# Loss
mean((y_te - pred)^2)
# [1] 4.635563

# xgb -- this should really be tuned
dxgb_train <- xgb.DMatrix(data = x_tr, label = y_tr)
md <- xgb.train(data = dxgb_train, 
                nround = 1000, max_depth = 10, eta = 0.01, 
                tree_method = "exact")

# xgb pred
pred_xgb <- predict(md, x_te)
points(x_te[,1], pred_xgb, col=3)
legend("bottomleft",c("agtboost","xgb"), col=c(2,3), pch=c(1,1))

# xgb loss
mean((y_te - pred_xgb)^2)
# [1] 4.864288

Of course, most real data has lots of dependence, which imply that an agtboost model might need a little bit more complexity for optimal predictions (but not much). On the positive side, they are likely to not overfit, and the underfit will be very small. And, again, more research and corresponding implementations for aGTBoost are coming on exactly this.

It would be very interesting to see what an optimally tuned xgboost model would do in the above example. My bet is that due to stochastic sampling and L2 regularization it might improve slightly on aGTBoost, but that this should be dominated by aGTBoost after the coming releases. Any thoughts on this?

Something I forgot to mention in the previous post, but which is relevant: The scheduled updates for aGTBoost are in the following order (as long as I am the only one working on it):

Optimal (automatic tuning of) L2-regularization
The same for stochastic sampling
Optimize code (paralellization and sparsity, and better implementation in general)

Papers on the above points will of course also follow...

szilard commented 3 years ago

Thank you @Blunde1 for insights into aGTBoost.

I tried algorithm="vanilla" on the airline data and indeed it shows " the first trees are rather deep, but then becomes smaller and smaller...":

it: 1  |  n-leaves: 178  |  tr loss: 0.476  |  gen loss: 0.477
it: 2  |  n-leaves: 104  |  tr loss: 0.468  |  gen loss: 0.4699
it: 3  |  n-leaves: 137  |  tr loss: 0.4616  |  gen loss: 0.4642
it: 4  |  n-leaves: 68  |  tr loss: 0.457  |  gen loss: 0.4599
it: 5  |  n-leaves: 74  |  tr loss: 0.4528  |  gen loss: 0.4562
it: 6  |  n-leaves: 54  |  tr loss: 0.4495  |  gen loss: 0.4533
it: 7  |  n-leaves: 54  |  tr loss: 0.4466  |  gen loss: 0.4507
it: 8  |  n-leaves: 67  |  tr loss: 0.4439  |  gen loss: 0.4484
.......
it: 23  |  n-leaves: 17  |  tr loss: 0.4285  |  gen loss: 0.4364
it: 24  |  n-leaves: 12  |  tr loss: 0.4282  |  gen loss: 0.4362
it: 25  |  n-leaves: 14  |  tr loss: 0.4279  |  gen loss: 0.436
it: 26  |  n-leaves: 28  |  tr loss: 0.4274  |  gen loss: 0.4357
it: 27  |  n-leaves: 8  |  tr loss: 0.4272  |  gen loss: 0.4356
it: 28  |  n-leaves: 12  |  tr loss: 0.4269  |  gen loss: 0.4354
it: 29  |  n-leaves: 2  |  tr loss: 0.4269  |  gen loss: 0.4354
it: 30  |  n-leaves: 6  |  tr loss: 0.4268  |  gen loss: 0.4354
it: 31  |  n-leaves: 9  |  tr loss: 0.4266  |  gen loss: 0.4353
it: 32  |  n-leaves: 3  |  tr loss: 0.4265  |  gen loss: 0.4352
it: 33  |  n-leaves: 14  |  tr loss: 0.4263  |  gen loss: 0.4351
it: 34  |  n-leaves: 7  |  tr loss: 0.4262  |  gen loss: 0.435
it: 35  |  n-leaves: 7  |  tr loss: 0.426  |  gen loss: 0.435
it: 36  |  n-leaves: 2  |  tr loss: 0.426  |  gen loss: 0.435
it: 37  |  n-leaves: 4  |  tr loss: 0.426  |  gen loss: 0.4349
it: 38  |  n-leaves: 2  |  tr loss: 0.4259  |  gen loss: 0.4349
it: 39  |  n-leaves: 2  |  tr loss: 0.4259  |  gen loss: 0.4349
.......
it: 113  |  n-leaves: 2  |  tr loss: 0.424  |  gen loss: 0.4341
it: 114  |  n-leaves: 2  |  tr loss: 0.424  |  gen loss: 0.4341
it: 115  |  n-leaves: 2  |  tr loss: 0.424  |  gen loss: 0.4341
it: 116  |  n-leaves: 2  |  tr loss: 0.4239  |  gen loss: 0.4341
it: 117  |  n-leaves: 2  |  tr loss: 0.4239  |  gen loss: 0.4341

the code and training time is:

system.time({
  md <- gbt.train(y = y_train, x = X_train, loss_function = "logloss", 
                  algorithm="vanilla", 
                  learning_rate = 0.1, nrounds = 10000, verbose = 1)
})

phat <- predict(md, X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")

md$get_num_leaves()

    user   system  elapsed 
1469.704    0.403 1469.800

0.7251307

It's good to have your understanding of the algo, thanks again for input.

szilard commented 3 years ago

Thanks also for the simulation example above. I was looking at that and shared with @Laurae2 ("It would be very interesting to see what an optimally tuned xgboost model would do in the above example.") and he'll answer here himself.

szilard commented 3 years ago

I understand your priorities for the next updates, and it makes perfect sense in the academic setting you are in (obviously you want to do research and finish the PhD etc.) However, if the goal was (say later) for aGTBoost to become used widely by practitioners, then it would need to be fast/efficient (enough) to be able to run competitively with xgboost/lightgbm on datasets of say 100K-1M records (apart from some niche use cases with small data, say ~1000 records e.g. in the medical field or so). I'm not saying this must be the goal, it is perfectly OK to keep it a research project, heck, it's already a very good one.

Laurae2 commented 3 years ago

Dumb "tuning" abusing knowledge of only 1 feature is relevant allows us to choose a proper more maximum depth of 2 instead of an absurd 10, which leads us out of the box to a better result than agtboost.

Algorithm	^2 error	Timing (3.7 GHz, 72 threads)
agboost global_subset	4.398976	(Singlethreaded) 102.818 seconds
agtboost vanilla	4.375327	(Singlethreaded) 130.814 seconds
xgboost untuned	4.348369	(Multithreaded) 3.205 seconds

Timings using a server with Dual Xeon 6154 (fixed 3.7 GHz, 36 physical cores) and 768 GB RAM (2666 MHz)

xgboost has no validation set and is untuned, and thus may perform better.

Note that the example no random seed set for reproducibility. I've included the example using seed 1 below.

# Load
library(agtboost)
library(xgboost)

# Dimensions
n <- 1000
m <- 1000

set.seed(1)

# Generate data
x_tr <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
x_te <- cbind(runif(n, -5, 5), matrix(rnorm(n * (m - 1)), nrow = n, ncol = m - 1))
y_tr <- rnorm(n, x_tr[,1]^2, 2) # Only first feature is significant, 999 is noise
y_te <- rnorm(n, x_te[,1]^2, 2)

# Training agtboost normal
# it: 430  |  n-leaves: 2  |  tr loss: 4.064  |  gen loss: 7.901
#    user  system elapsed 
# 102.843   0.000 102.818 
system.time({mod <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10)})

# Predict
pred <- predict(mod, x_te)
mean((y_te - pred)^2)
# [1] 4.398976

# Training agtboost vanilla
# it: 380  |  n-leaves: 2  |  tr loss: 4.045  |  gen loss: 8.565
#    user  system elapsed 
# 130.827   0.000 130.814 
system.time({mod_van <- gbt.train(y_tr, x_tr, learning_rate = 0.01, verbose = 10, algorithm = "vanilla")})

# Predict
pred_van <- predict(mod_van, x_te)
mean((y_te - pred_van)^2)
# [1] 4.375327

# xgb with no validation set
#    user  system elapsed 
# 230.341   0.000   3.205 
dxgb_train <- xgb.DMatrix(data = x_tr, label = y_tr)
system.time({md <- xgb.train(data = dxgb_train,
                             nround = 1000, max_depth = 2, eta = 0.01,
                             verbose = 1, print_every_n = 10,
                             tree_method = "exact", nthread = 72)})

# xgb pred
pred_xgb <- predict(md, x_te)
mean((y_te - pred_xgb)^2)
# [1] 4.348369