Open szilard opened 6 years ago
This code runs (using catboost R package commit 7e4ba38 on AWS EC2 r4.8xlarge):
training_time AUC
1M records: 94sec 0.7429
10M records: 1100sec 0.7498
so slower than h2o/xgboost/lightgbm.
@catboost Can you guys tune the above catboost code run significantly faster?
Is this the same speed benchmark as the front poage for xgboost or lightgbm (r packages)?
Yes, should be comparable to the results from the main README in this repo.
@szilard actually we did tune the code to run much faster, now we should be faster than xgboost and the same as lightGBM. We are working on more speedups now.
@szilard And we also have implemented GPU training, we compared on Epsilon dataset, and it's 2 times faster than LightGBM and 20 times faster than XGBoost, it would be nice to add catboost to GPU benchmarks.
Thanks @annaveronika . Yeah, I talked to @sab (Sergey Brazhnik) at the NIPS conference in December, I should definitely run the benchmarks with the latest catboost version (and with the GPU version as well).
The CPU version is still 10x slower than lightgbm:
training_time AUC
previous: 1M records: 94sec 0.7429
current version: 1M records: 64sec 0.7406
@annaveronika @sab The catboost code I'm running is this: https://github.com/szilard/GBM-perf/blob/9465eea3faf843e6133605c8e6341940da919c78/wip-testing/catboost/run.R
Anything wrong with that?
If you are running the latest version built from code on github than it is correct.
But if I understand correctly you are running benchmarks on airlines dataset - this is actually a dataset with 6 or 8 categorical features, so it's fair to use them as categorical. If you want to do one-hot encoding, you could use one-hot-max-size = some large number feature for catboost. Or just pass them as categorical.
LightGBM does a specific optimisation for this - they pack binary features from one-hot encoding into one histogram, so they basically work with 8 features, not with all these binary ones. We'll do the same if you use one-hot-max-size feature.
On regular datasets, not one-hot encoded, for example on Epsilon, this difference will be elliminated. We will also do this optimisation later.
But anyway it's better even for quality of catboost to use categorical features as is, not one-hot encoded.
Adding one_hot_max_size = 1000
will give:
Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
catboost/libs/options/cat_feature_options.h:164: Error in one_hot_max_size: maximum value of one-hot-encoding is 255
then one_hot_max_size = 250
:
Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
util/generic/hash.h:1654: Key not found in hashtable: -2141854053
Timing stopped at: 448.5 32.24 48.02
so it errors out, but speed is still >48sec
.
Yes, I forgot about large one-hot-max size - we didn't add it, because it is better in quality to use statistics for cat features with many values. With cat features it works more slow, because it generates many feature combinations. We first of all optimise for quality, so we allow for deep combinations. If you want speed, you need to do max_ctr_complexity 1 and one_hot_max_size 255 If you want quality, you need defaults. The bug - we'll fix and I get back to you.
In the meanwhile I would suggest to also try some dataset with many features.
We will also allow for one-hot-max-size > 255
@annaveronika I made a very simple test for lightgbm
regressor vs catboost
regressor without any categorical variables:
import lightgbm as lgb
from catboost import CatBoostRegressor
from sklearn.datasets import make_regression
import timeit
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
X, y = make_regression(n_samples=10000, n_features=500, random_state=0)
gbm1 = lgb.LGBMRegressor(objective='regression', n_estimators=100)
gbm2 = CatBoostRegressor(loss_function='RMSE', n_estimators=100, verbose=False)
print('lightgbm time: ', timeit.timeit(wrapper(gbm1.fit, X, y=y), number=1))
print('catboost time: ', timeit.timeit(wrapper(gbm2.fit, X, y=y), number=1))
The results are:
lightgbm time: 3.8449372718470847
catboost time: 20.269001481390248
I upgraded to latest versions of both packages today:
catboost.__version__
Out[3]: '0.6.1'
lgb.__version__
Out[4]: '2.1.0'
Any idea what's up? Should the speeds be on par for this toy example?
You need to set thread_count for both of them to the same number, for example to 16 so that is is parallelized with same amount of threads. Plus LightGBM builds different trees by default, so the fair comparison needs to take it into account. To build more or less same trees you need to set num_leaves=64 in LightGBM.
But there will still be a difference, and the reason for it is that we have an expensive preprocessing of the data before starting iterations, and it is proportional to how many different values in data you have. If you generate it at random, then all the values will be different, so preprocessing will be long. For real data it is usually less. Plus on real scenarios when you have a thousand of iterations this preprocessing doesn't play role. And now it is more than half of the time.
We have other preprocessing schemes, you can set feature_border_type='Median' or 'GreedyLogSum', they are much faster.
So your script with this parameters:
gbm1 = lgb.LGBMRegressor(objective='regression', n_estimators=100, num_leaves=64)
gbm2 = CatBoostRegressor(loss_function='RMSE', n_estimators=100, verbose=False, thread_count=16, feature_border_type='GreedyLogSum')
will give results:
('lightgbm time: ', 5.4255828857421875)
('catboost time: ', 4.542175054550171)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 4.835729122161865)
('catboost time: ', 4.658789873123169)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 5.37877082824707)
('catboost time: ', 4.683574914932251)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 6.206507205963135)
('catboost time: ', 4.6429479122161865)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 9.472325086593628)
('catboost time: ', 4.676542043685913)
This shows that catboost is a little faster + stable in speed. And LightGBM is changing time in runs.
Actually LightGBM performs best with 1 thread per core, and it sets 32 threads by default, so the comparison needs to be changed - if we run it without hyperthreading, we will get results like that:
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.2300100326538086)
('catboost time: ', 4.494148015975952)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.648721933364868)
('catboost time: ', 4.5552661418914795)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.2850260734558105)
('catboost time: ', 4.801781177520752)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.6587960720062256)
('catboost time: ', 4.60023307800293)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.0055198669433594)
('catboost time: ', 4.592664957046509)
espetrov@park:~/svn/trunk/arcadia$ OMP_NUM_THREADS=16 python github_catboost_regr.py
('lightgbm time: ', 3.0650908946990967)
('catboost time: ', 4.515332937240601)
And for large datasets we have same speed (here again 16 threads, 64 leaves, no hyperthreading, and 100k docs):
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 12.613971948623657)
('catboost time: ', 13.314411163330078)
espetrov@park:~/svn/trunk/arcadia$ python github_catboost_regr.py
('lightgbm time: ', 12.66583800315857)
('catboost time: ', 12.891108989715576)
Back to the airline data (1M), with max_ctr_complexity=1
it runs for 15sec and AUC=0.7347226
max_ctr_complexity=1, one_hot_max_size=250
errors out (the same bug I guess):
Error in catboost.train(learn_pool = dx_train, test_pool = NULL, params = params) :
util/generic/hash.h:1654: Key not found in hashtable: -2141854053
@szilard could you try upgrading the version? This bug should have been fixed today in version 0.6.1.1.
One more thing about airline data. It is a very special dataset since it has little amount of features. On this amount the bottleneck of the algorithm changes, it is usually the selection of the structure of the tree. And if you have less then 10 featues, then the bottleneck is calculation of the resulting leaf values.
When calculating leaf values we do several gradient steps inside one tree, which makes this particular process longer, which is not seen usually. To compare the implementation speed you can set leaf_estimation_iterations=1. But for quality purposes I would recommend to have defaults.
OK, now one_hot_max_size=255
runs 49sec AUC=0.7388133
max_ctr_complexity=1 & one_hot_max_size=255
runs 13sec AUC=0.733911
While having more datasets in number 1 on my wishlist for a more complete benchmark, see e.g. my KDD talk
I'm still doing this in my spare time, so probably not gonna happen any time soon.
Just to keep in mind the other runs:
without one-hot encoding: runs 65sec AUC=0.7424685
without one-hot encoding but with max_ctr_complexity=1
runs 14sec AUC=0.7338973
I'm gonna try the GPU version as well. As far as I see the R package does not have GPU support, right? (in that case I guess I'll have to use the python API)
Summary:
airline dataset 1M records
from R on r4.8xlarge (32 cores)
iterations = 100, depth = 10, learning_rate = 0.1
runtime AUC
65s 0.7426548
one_hot_max_size=255 48s. 0.7376878
max_ctr_complexity=1 15s. 0.7345624
max_ctr_complexity=1, one_hot_max_size=255. 13s 0.7336248
One more: leaf_estimation_iterations=1
. 65s. AUC=0.742523
@annaveronika Thank you for your answers - they are very helpful. There is a lot of in-depth knowledge in them. May I suggest a blog post or doc or example titled something like "How to make catboost as fast as possible" wherein you cover what you did in this thread? I understand there will likely be a hit in performance, but some practitioners may be OK with that.
on GPU:
p3.2xlarge 1 GPU Tesla V100
Ubuntu 16.04
CUDA 8.0
from Python with task_type = "GPU"
it trains for 5sec but then .fit()
hangs for a while (transferring back data to the CPU?) and wall time is 35sec
also accuracy is pretty bad AUC=0.68341
It's again a bug (the accuracy and the wait after the training), the code of the fix is already on github, but not jet on pypi. Will be there in about two days, together with beta version of multimachine training on GPU. You could try building from source using the instruction here https://tech.yandex.com/catboost/doc/dg/concepts/python-installation-docpage/ or wait for the fix on pypi.
One more thing about the speed - the current version has feature parallellization, which will not give optimal speedups for 8 features. The document parallel version will come soon. For dataset with many features the current GPU training gives up to 46 times speedup compared to CPU, dependent on the dataset size.
Thanks for update, I'll wait 2 days and try again.
Results now (0.6.2):
96: learn: 0.3879724063 total: 5.45s remaining: 169ms
97: learn: 0.3878293437 total: 5.5s remaining: 112ms
98: learn: 0.3877669062 total: 5.55s remaining: 56.1ms
99: learn: 0.3876353125 total: 5.61s remaining: 0us
CPU times: user 40.5 s, sys: 5.46 s, total: 45.9 s
Wall time: 21 s
Out[15]: <catboost.core._CatBoostBase at 0x7fa96d514190>
In [17]: metrics.roc_auc_score(y_test, y_pred)
Out[17]: 0.7417231363364158
The GPU/CPU usage by gpustat/mpstat:
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 0 / 16160 MB |
[0] Tesla V100-SXM2-16GB | 40'C, 4 % | 436 / 16160 MB | root(426M)
[0] Tesla V100-SXM2-16GB | 40'C, 3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 40'C, 3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 40'C, 3 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C, 86 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C, 83 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 42'C, 84 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 43'C, 85 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 43'C, 81 % | 15374 / 16160 MB | root(15364M)
[0] Tesla V100-SXM2-16GB | 41'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 41'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 41'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
[0] Tesla V100-SXM2-16GB | 40'C, 0 % | 438 / 16160 MB | root(428M)
11:18:03 PM all 11.67 0.00 12.55 0.25 0.00 0.00 0.00 0.00 0.00 75.53
11:18:04 PM all 12.78 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 86.97
11:18:05 PM all 17.17 0.00 10.78 0.13 0.00 0.00 0.13 0.00 0.00 71.80
11:18:06 PM all 11.40 0.00 12.91 0.00 0.00 0.00 0.00 0.00 0.00 75.69
11:18:07 PM all 10.68 0.00 8.54 0.00 0.00 0.00 0.00 0.00 0.00 80.78
11:18:08 PM all 16.92 0.00 9.02 0.00 0.00 0.00 0.00 0.00 0.00 74.06
11:18:09 PM all 17.38 0.00 8.62 0.00 0.00 0.00 0.00 0.00 0.00 74.00
11:18:10 PM all 17.40 0.00 8.01 0.00 0.00 0.00 0.00 0.00 0.00 74.59
11:18:11 PM all 10.79 0.00 3.81 0.00 0.00 0.00 0.25 0.00 0.00 85.15
11:18:12 PM all 9.69 0.00 3.83 0.00 0.00 0.00 0.00 0.00 0.00 86.48
11:18:13 PM all 10.56 0.00 3.18 0.00 0.00 0.00 0.13 0.00 0.00 86.13
11:18:14 PM all 11.22 0.00 3.19 0.00 0.00 0.00 0.13 0.00 0.00 85.46
11:18:15 PM all 10.06 0.00 4.33 0.00 0.00 0.00 0.25 0.00 0.00 85.35
11:18:16 PM all 27.39 0.00 5.61 0.00 0.00 0.00 0.00 0.00 0.00 67.01
11:18:17 PM all 92.38 0.00 7.38 0.00 0.00 0.00 0.00 0.00 0.00 0.25
11:18:18 PM all 99.62 0.00 0.38 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:18:19 PM all 78.35 0.00 1.75 0.00 0.00 0.00 0.00 0.00 0.00 19.90
11:18:20 PM all 8.75 0.00 5.50 0.00 0.00 0.00 0.00 0.00 0.00 85.75
11:18:21 PM all 13.75 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 86.00
11:18:22 PM all 13.64 0.00 0.38 0.00 0.00 0.00 0.00 0.00 0.00 85.98
11:18:23 PM all 13.50 0.00 0.50 0.00 0.00 0.00 0.12 0.00 0.00 85.88
11:18:24 PM all 13.88 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.00 86.00
11:18:25 PM all 25.53 0.00 0.50 0.13 0.00 0.00 0.00 0.00 0.00 73.84
11:18:26 PM all 0.00 0.00 0.12 0.00 0.00 0.00 0.00 0.00 0.00 99.88
So while training on the GPU is only 5sec, total training time is 20sec. There is about 5sec before GPU training and 10sec after GPU training when some computations happen on the CPU, I wonder what it is, if you can elaborate on that (and maybe that can be cut/optimized)?
The preprocessing contains data binarization and calculation of part of statistics for categorical features, loading everything on GPU.
The postprocessing contains calculation of all selected statistics on categorical features, loading everything on CPU.
These parts will be speed up, but for training 100 iterations on V100 GPU will not be a bottleneck anyway. In real life you don't train for 100 iterations, so I don't think we should specifically optimize for that.
Also could you check that you are running 16 threads?
@sergeyf We are planning to provide this guide. Here is the issue: https://github.com/catboost/catboost/issues/253
Thank you!
On Feb 13, 2018 12:57 AM, "annaveronika" notifications@github.com wrote:
@sergeyf https://github.com/sergeyf We are planning to provide this guide. Here is the issue: catboost/catboost#253 https://github.com/catboost/catboost/issues/253
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/szilard/GBM-perf/issues/4#issuecomment-365192594, or mute the thread https://github.com/notifications/unsubscribe-auth/ABya7D77x7MIxM3P9aRWtmsL4mPCjJuYks5tUU5ugaJpZM4OcrOE .
Thanks @annaveronika for explanation on the process before and after the GPU computation. I added thread_count = multiprocessing.cpu_count()
to "force" multithreading, but the wall time is the same 20sec (and monitoring the CPU utilization e.g. with htop shows the same pattern as before, sometimes only 1 CPU core is utilized).
Not sure what you mean by "In real life you don't train for 100 iterations". For many datasets in practice you might overfit after a few hundred iterations, so one should do early stopping. Also, I would assume that 200 iterations takes more or less 2x 100 iterations (at least with the other tools on CPU), so it shouldn't really matter how many 100 iterations we benchmark (if we do that across the board).
catboost GPU 200 iterations: GPU time 11sec (roughly 2x), wall time 32sec (so the CPU part increased only from 15sec to 21sec, not 2x).
In comparison, for xgboost there is no pre- and post- GPU computation lag:
[0] Tesla V100-SXM2-16GB | 41'C, 50 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C, 48 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C, 49 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 41'C, 47 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 47 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 47 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 45 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 48 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 48 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 48 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 48 % | 602 / 16160 MB | ubuntu(592M)
[0] Tesla V100-SXM2-16GB | 42'C, 45 % | 604 / 16160 MB | ubuntu(594M)
[0] Tesla V100-SXM2-16GB | 42'C, 47 % | 604 / 16160 MB | ubuntu(594M)
03:00:58 AM all 18.63 0.00 1.61 0.00 0.00 0.00 0.00 0.00 0.00 79.75
03:00:59 AM all 19.10 0.00 1.13 0.00 0.00 0.00 0.00 0.00 0.00 79.77
03:01:00 AM all 19.20 0.00 1.63 0.00 0.00 0.00 0.00 0.00 0.00 79.17
03:01:01 AM all 18.84 0.00 2.01 0.00 0.00 0.00 0.00 0.00 0.00 79.15
03:01:02 AM all 20.70 0.00 1.87 0.00 0.00 0.00 0.00 0.00 0.00 77.43
03:01:03 AM all 21.83 0.00 1.38 0.00 0.00 0.00 0.00 0.00 0.00 76.79
03:01:04 AM all 23.82 0.00 1.37 0.00 0.00 0.00 0.00 0.00 0.00 74.81
03:01:05 AM all 23.28 0.00 1.75 0.00 0.00 0.00 0.00 0.00 0.00 74.97
03:01:06 AM all 22.54 0.00 2.05 0.00 0.00 0.00 0.00 0.00 0.00 75.42
03:01:07 AM all 26.30 0.00 1.73 0.00 0.00 0.00 0.00 0.00 0.00 71.98
03:01:08 AM all 25.97 0.00 1.87 0.00 0.00 0.00 0.00 0.00 0.00 72.16
03:01:09 AM all 23.23 0.00 1.14 0.00 0.00 0.00 0.00 0.00 0.00 75.63
03:01:09 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
03:01:10 AM all 25.19 0.00 1.63 0.00 0.00 0.00 0.00 0.00 0.00 73.18
03:01:11 AM all 23.02 0.00 1.89 0.00 0.00 0.00 0.00 0.00 0.00 75.09
03:01:12 AM all 1.38 0.00 0.25 0.00 0.00 0.00 0.00 0.00 0.00 98.38
Similarly for lightgbm GPU, thought lightgbm implementation seems less efficient (it's only using ~5% GPU and it's slower).
I mean 100 interations is almost never optimal. There is a tradeof between learning rate and amount of iterations, and the less learning rate the more iterations you need, and the better will be the quality. Until at some learning rate converges to best quality. To make sure you get good quality we set learning rate to 0.03 by default, which in many cases is good enough.
And for large datasets you need more iterations than for small given a fixed learning rate. For large dataset you usually need thousands of iterations for best quality. And GPU you need most of all for large datasets, because on them training is really slow on CPU. For small datasets GPU and CPU training have not a huge difference.
I actually agree with what you are saying above.
The setup trees=100, learn_rate=0.1, depth=10
is not to be considered optimal, and sure often a smaller learning rate is better, and for larger datasets you get better accuracy with more trees.
The goal of my little benchmark is to compare speed (and also see if accuracy is not something really bad which would be sign of a bug, e.g. the one it helped you find a few days ago).
With the CPU versions the speed has been pretty much linear in the number of iterations (and dataset size), also deeper trees is slower etc. So I could do trees=1000
but it would be just roughly 10x the time. I could change the other params, but more likely all the tools would move in similar ways.
However, I did all the above with a mindset of 2-3 yrs ago when there were no GPU versions (it's when I started the other "main" benchm-ml GitHub repo). I see now that some of those premisses of runtime scaling on the GPU are not true (e.g. vs dataset sizes or number of trees), so I might experiment a little bit with changing the params in the next few days (for all tools).
Btw are you guys planning on having the GPU version available from R any time soon as well?
We definitely will do this, here is the issue: https://github.com/catboost/catboost/issues/255 But I cannot tell you about the timing now, we need to finish several other tasks before - opensource distributed cpu training and more ranking modes.
@szilard
JFYI: In attachment plot with CatBoost GPU vs CPU (dual-socket Intel Xeon E2560v2) speed comparisons for dataset with different sample count on Tesla K40, GTX1080Ti and V100 (plot was builded from samples of our internal dataset with approximately 700 numerical features, K40 is ≈6 times faster, but it's not easy to see on the plot because of V100, which is ≈45 times faster). Benchmark was run with -x32 option, for default -x128 results are slightly worse.
Also, I would assume that 200 iterations takes more or less 2x 100 iterations (at least with the other tools on CPU), so it shouldn't really matter how many 100 iterations we benchmark (if we do that across the board).
It's not true. For histogram-based algorithm on decision trees learning of full ensemble in general is not a linear function. CatBoost (as well as LightGBM) uses at most half of the data to compute necessary statistics for splits after first leaf was build. For some datasets splits on latter trees are highly imbalanced and in such case we'll learn this trees faster than first balanced ones
Such small benchmarks would be almost correct (in terms of speed) for oblivious trees (because they are symmetric and this gives more balanced trees), but could be very misleading for GBMs with leaf wise trees (like LightGBM). I saw several examples, when LightGBM would learn small and simple trees on first iterations and starts to build very deep imbalanced trees after.
@Noxoomo For CPU lightgbm the training time and AUC vs number of trees (with the data and code from this repo):
n_t tm auc tmr
1 100 5.351 0.7660324 5.351000
2 300 14.618 0.7721249 4.872667
3 1000 40.960 0.7735019 4.096000
4 3000 119.199 0.7724849 3.973300
so runtime is not dramatically far from linear in the number of trees (last trees are not even 2x faster than first few)
The code added to the code in this repo:
d_res <- data.frame()
for (n_t in c(100,300,1000,3000)) {
tm <- system.time({
md <- lgb.train(data = dlgb_train,
objective = "binary",
nrounds = n_t, num_leaves = 512, learning_rate = 0.1,
verbose = 0)
})[[3]]
phat <- predict(md, data = X_test)
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
auc <- performance(rocr_pred, "auc")@y.values[[1]]
d_res <- rbind(d_res, data.frame(n_t, tm, auc, tmr=tm/n_t*100))
}
d_res
CPU xgboost with tree_method = "hist"
n_t tm auc tmr
1 100 17.820 0.7494959 17.8200
2 300 39.372 0.7562533 13.1240
3 1000 122.849 0.7654567 12.2849
4 3000 323.340 0.7715660 10.7780
GPU xgboost
n_t tm auc tmr
1 100 7.857 0.7482401 7.857
2 300 16.03 0.7562974 5.343333
3 1000 40.812 0.7668228 4.0812
4 3000 114.407 0.7710445 3.813567
New boosting lib from yandex:
https://github.com/catboost/catboost