[Discussion] efficiency improvements

guolinke commented 4 years ago

This is to call the efficiency improvement for LightGBM, include but not limited to:

Tree learning algorithm acceleration
I/O related, like dataset loading and model saving
Inference speed improvement
Better GPU acceleration
Distributed learning improvement

If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.

StrikerRUS commented 4 years ago

Linking #2934 and #2935.

MaxHalford commented 4 years ago

I have a suggestion but I'm not sure it's in-scope. It concerns improving efficiency during the prediction speed of trees. This in turn could improve the overall training time because gradient boosting requires asking the weak models to predict gradients.

I'm not familiar with the tree implementation in LightGBM. The intuitive way of having two pointers that refer to the left and right childs doesn't seem to be the best format for evaluating a tree. Basically, it seems that there are two alternative ways:

Compile the tree to a sequence of ifs and elses. AFAIK this is what treelite allows you to do.
Convert a tree into a heap, which means the nodes are contiguous in memory. Each branch contains two integers that indicate where to go next in the heap. This is quite well explained in this Wikipedia article section.

I thought about suggesting this after having read this great blog post by Andrew Tulloch, which includes some in-depth benchmarks. I don't know if this is something LightGBM already integrates. Also I'm not well-versed in C++ so I can't help on that side. However these solutions seem rather simple to implement.

julioasotodv commented 4 years ago

Apart from the possibility to use CUDA instead of OpenCL for GPU trees (even though it would reduce compatibility), xgboost has made a lot of progress using Dask for distributed training (in fact, I believe that other experimental distributed training techniques they had are now deprecated).

I believe someone started a project fro smoother integration between LightGBM and Dask, but I am unable to find it right now...

As for GPU, I think that guys at Rapids (Nvidia) are looking forward to developing more CUDA implementations for different ML algorithms (just like they have done with xgboost, as they have rewritten the whole gpu tree learner in the library). Maybe contacting them could be a good idea.

jameslamb commented 4 years ago

dask-lightgbm is here: https://github.com/dask/dask-lightgbm

It's not something we officially maintain or contribute to yet, but I agree it could be a useful thing for us to help with.

guolinke commented 3 years ago

@MaxHalford Thanks very much. Actually, LightGBM is able to convert tree model to c++ codes. However, the tree in lightgbm is not the balance tree, so it is not trivial to optimized by the methods your post. But this is indeed a good potential direction. BTW, we will implement the symmetric tree (like in CatBoost) in the future (by @shiyu1994 ), which will be much faster in prediction time, and more robust for the small dataset.

MaxHalford commented 3 years ago

@guolinke I think the trick still works. You just need to maintain two arrays that indicate the left and right child of each node. That way you can "walk" to the next node with something like:

if x[features[node]] < thresholds[node]:
    node = left[node]
else:
    node = right[node]

guolinke commented 3 years ago

Thank you @julioasotodv . Actually, the CUDA implementation in existing tools already integrates most "Histogram tricks" from LightGBM. So it is more like the "reinvert the wheel" if we just simply re-implement it again. We plan to revisit the existing implementations and estimate the potential improvements. If there are still some rooms, we will develop the new version of LightGBM GPU.

guolinke commented 3 years ago

@MaxHalford , The current prediction logic is very similar to that fashion, it is https://github.com/microsoft/LightGBM/blob/9b2637354a84e0da813f8e2b22608a87eee01e4c/include/LightGBM/tree.h#L572-L584 .

MaxHalford commented 3 years ago

Cheers @guolinke. I don't believe there's much improvement to bring from that side then.

How about the method that is discusses in this scikit-learn issue? Is it already implemented?

guolinke commented 3 years ago

@MaxHalford i quick through the PR. I think it is related to python-side histogram implementation, and our cpp implementation is already optimized for both speed and memory.

And we have LRU histogram pool to further reduce memory cost.

ogrisel commented 3 years ago

For the record, LightGBM is 1.5x to 2x faster than the master branch of scikit-learn in sequential mode at the moment. This difference reduces with many threads because of the diminishing returns of parallelism after 8 to 16 CPU threads but that depends on the shape of the dataset, number of nodes... I haven't investigated yet why LightGBM is so much faster in sequential mode yet. It wasn't the case 6 months ago if I remember correctly.

guolinke commented 3 years ago

@ogrisel the sequential mode is num_thread=1? Did you test the LightGBM 3.0.0 ? We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature. LightGBM will test the col-wise and row-wise before the training, when given a data and parameter, and use the faster one automatically.

ogrisel commented 3 years ago

@ogrisel the sequential mode is num_thread=1?

I just set the OMP_NUM_THREADS=1 environment variable when running my benchmark script. I suppose this is equivalent.

Did you test the LightGBM 3.0.0 ?

Yes.

We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature.

Ah thanks that's probably it then. Indeed the Higgs boson benchmark I used has not that many features.

jameslamb commented 3 years ago

I've just added a new efficiency-related feature: #3369

If anyone following along here has experience with Apache Arrow, please add any context you think could help on that issue, or let me know if I've made any mistakes.

jameslamb commented 3 years ago

I've started a discussion at https://github.com/dask/community/issues/104 about the possibility of bringing dask-lightgbm into this project, similar to the direction xgboost has gone:

discussion: https://github.com/dask/dask-xgboost/issues/39
first PR: https://github.com/dmlc/xgboost/pull/4473

@StrikerRUS @guolinke I've been working with Dask a lot and would be willing to take responsibility for maintenance work on this. I agree with the decisions reached in those two xgboost links above, that having a native integration with Dask that we maintain would lead to a better user experience.

StrikerRUS commented 3 years ago

@jameslamb Awesome proposal! Unfortunately, I'm not familiar with dask, but my intuition was that dask is about scaling training up to multiple machines. Something like "for small data use your own laptop, for big data use dask on several machines, for huge data use Spark on several clusters". Please correct me if I'm wrong. Anyway, my main concern is that by adopting dask wrapper we will have to ensure its' quality by unit tests on out side. But we have no resources to run distributed tests. Our current MPI test just ensures that MPI version can be built, but all tests are executed on one machine.

jameslamb commented 3 years ago

Thanks!

but my intuition was that dask is about scaling training up to multiple machines. Something like "for small data use your own laptop, for big data use dask on several machines, for huge data use Spark on several clusters". Please correct me if I'm wrong

Dask is a thing that allows for horizontal scaling across multiple machines, but it's also an API for expressing Python work that could be parallelized (even if that parallelization is all on one machine).

So, for example, people often use Dask Dataframes as a way to work with larger-than-memory data frames on a single machine. You write code that closely follows the pandas API, and Dask takes care of moving data back and forth between disk and memory without you needing to think about it.

So if we follow that proposal, you'd be able to use a Dask-ified version of lightgbm models to train on Dask collections (Dask DataFrame, Dask Array), which would be beneficial even if not scaling out to multiple machines.

by adopting dask wrapper we will have to ensure its' quality by unit tests on out side. But we have no resources to run distributed tests. Our current MPI test just ensures that MPI version can be built, but all tests are executed on one machine.

I don't think we'd have to do this for this to be successful. The dask-lightgbm project already does not do this, and approximates it with docker-compose.

One of the main design principles of Dask is that it scales from your laptop to a distributed cluster with minimal code changes...that means that we can test against a dask.distributed.LocalCluster and have some confidence that it will work on a true multi-node cluster.

Issues that are specific to the multi-node setting might be raised by users and have to be addressed here with manual testing, but that is already the current state of LightGBM-on-Dask with dask-lightgbm, so I don't think it's a reason not to move forward with that proposal.

StrikerRUS commented 3 years ago

@jameslamb Thanks a lot for the detailed response!

.that means that we can test against a dask.distributed.LocalCluster and have some confidence that it will work on a true multi-node cluster.

OK, this is great!

jameslamb commented 3 years ago

Thanks! @guolinke what do you think?

If we move forward with this, I'll ask the dask-lightgbm maintainers to make a PR here with the main components, and then we I can generate some issues for what else would need to be done for testing, docs, etc.

guolinke commented 3 years ago

awesome! sounds good to me

ppaleja commented 3 years ago

Hi, I hope this is the right place to post this, but I was reading through the original LightGBM paper, and it mentions a greedy coloring scheme in which to do the (I think) one of the main optimizations that LightGBM, the feature bundling. I was wondering why a greedy method was chosen to do this instead of some other targeted approach (I think the paper mentions that bundling works well in the sparse matrix case and I believe there are coloring algorithms that specifically target sparse graphs and have more provable guarantees than a greedy method would). Sorry if this has already been asked, but has there been any work on using a different algorithm for this bundling process? I feel that this might be worth looking into, but I wasn't able to find in the code where the bundling process occurs (if it exists currently), to see if I could 'plug in' another algorithm and see if it would give a noticeable improvement.

lorentzenchr commented 4 months ago

We have several speed optimizations in the 3.0.0 release. In particular, LightGBM also implement row-wise histogram algorithm (previous is column-wise), and may is faster with smaller #thread or smaller #feature.

Is it possible to lay out the algorithm in this case of row-wise histogram computation (pointing to literature, add code comments, writing a blog post, ...)? I find it very difficult to follow the logic reading the code.

microsoft / LightGBM

[Discussion] efficiency improvements #2791