Additional performance benchmarks

Zahlii commented 3 years ago

Hi, currently evaluating this as a potential performance enhancement on our MLOps / Inference stack.

Tought I'd give some numbers here (based on MacBook Pro 2019).

Test set up as follows: a) generate artificial data X = 1E6 x 200 float64, Y = X.sum() for regression, Y = X.sum() > 100 for binary classifier b) for n_feat in [...] -> fit model on 1000 samples and n_feat features; compile model c) for batchsize in [...] -> predict 10 times a randomly sampled batch of all data items, using (1) LGBM.predict(), (2). lleaves.predict(), (3) lleaves.predict(n_jobs=1); measure TOTAL time taken

For regression results are:

Independent of the number of features, the break-even between parallel lleaves and 1 job seems to be around 1k samples at once, independent of the number of features. Using this logic, we would get better performance than LGBM at all number of samples.

For classification:

Also, here, the break-even is around 1k samples.

For classification with HIGHLY IMBALANCED data (1/50 positive), the break-even is only at 10k samples - Any ideas on why this is the case?

Zahlii commented 3 years ago

Some further ones, this time including categorical-only features

Classifier

Regression

siboehm commented 3 years ago

What's the issue here? The plots look fine to me. Some notes:

Benchmarking should always be done on something as close to the production model as possible. Things like tree sizes, tree depth, tree layout, number of categoricals, ... will have a large impact on (relative) performance
Ideally you should look at the trees that are produced by your benchmark using lightgbm.plot_tree to make sure they are not somehow degenerate.
The parallelization implementation of lleaves is kept pretty simple and just implemented in Python, whereas LightGBM calls pthreads directly from C++ afaik. This will mean the parallelization overhead of lleaves is larger, hence the break-even comes somewhat late.

Zahlii commented 3 years ago

@siboehm no real issue here; I just wanted to share the findings I had based on the benchmark. To me, the important take-away is that for most inference payloads WE are seeing (usually 1-100 samples at a time), lleaves provides a performance gain, although only with disabled parallelization. Since the break-even can vary wildly, I think it may be important for high-performance settings to smartly toggle the parallelization on/off depending on the number of samples to be predicted at once.

siboehm commented 3 years ago

That's true! Thanks for sharing your benchmark results, I thought there was some performance issue you were bringing up but even after squinting hard at the plots could see anything out of the ordinary :D So I'm happy lleaves is working well for you!

Regarding the parallelization:

Right now (without setting any options) lleaves parallelizes by using a threadpool of size os.cpu_count(). On a CPU with Hyperthreads this will be 2x the number of physical cores. Alternatively lleaves could default to something like os.cpu_count() / 2, which probably has much less overhead for only a slight dip in performance.
I might rework the parallelization interface at some point, so there'll be some changes coming. I want to always keep it as an option for people to directly interface with the C function for least possible overhead.

If it's ok for you feel free to close the issue, but do keep me in the loop if you find any other outliers / observations :) I'm interested in how people are using lleaves and whether it makes more sense to develop the library into the easy-to-use or highest-possible-performance direction.

siboehm / lleaves

Additional performance benchmarks #4