LightGBM and dynamic allocation

microsoft / SynapseML

Simple and Distributed Machine Learning

http://aka.ms/spark

MIT License

5.06k stars 830 forks source link

LightGBM and dynamic allocation #319

Open tchemineau opened 6 years ago

tchemineau commented 6 years ago

Hello,

We are currently trying to use MMLSpark and LightGBM in our Spark cluster. We usually use default dynamic allocation of executors when starting jobs, but LightGBM seems to not work with this setting (the main Thread used to wait for executors connection only replies to the first ones created during the initialisation of the main Spark context).

How is it possible to make LightGBM work in a dynamic executors allocation way?

Thanks!

EDIT: we are using the current master branch

imatiach-msft commented 6 years ago

@tchemineau lightGBM currently does not support adding new executors/machines dynamically after NetworkInit, adding @guolinke to see if this is a feature that might be added in the future to the underlying library. Otherwise, if you could start all executors before the NetworkInit in the dynamically allocated cluster you could work around the issue, but it doesn't sound like a real solution.

imatiach-msft commented 6 years ago

@tchemineau @guolinke This is the network init API in LightGBM: https://github.com/Microsoft/LightGBM/blob/master/include/LightGBM/c_api.h#L749 We are currently calling it from here: https://github.com/Azure/mmlspark/blob/02642ce4fd1fc447d5e25e30e6a64d0b845da4f1/src/lightgbm/src/main/scala/TrainUtils.scala#L201 I don't believe there is any way to add or remove nodes dynamically currently. I've contacted @guolinke to find out if this is planned in the future.

imatiach-msft commented 6 years ago

@tchemineau I spoke to @guolinke, and he wrote that it is possible to do this via the following (it will involve refactoring some of the current code): " Technologically, this can be achieved by using init_score.

Where there are new-coming nodes, we can do following to achieve that:

Stop the training in current nodes, and get current trained model file
Init new nodes
Send model file from step 1. to new nodes
Continued training from that model file. (Refer the continued train logic in python-package: https://github.com/Microsoft/LightGBM/blob/master/python-package/lightgbm/engine.py#L112-L124 )

BTW, this logic also can be used for the fault-tolerance: where some nodes are down, we can restore the training from that movement. "

tchemineau commented 6 years ago

Thank you all for your answers and having digging this question.

At the moment, we switched to start a bunch of static executors in our Cluster just for that LightGBM jobs. But even with that, we have some issues (I will not discuss about them, this is not the subject).

This would definitely be a big improvement of MMLSpark/LightGBM to be able to run jobs and let YARN allocate resources based on the usage. This would work very well in a AWS EMR deployment for example, with autoscaling in place.

How do you plan to move forward on this new feature?

imatiach-msft commented 6 years ago

"How do you plan to move forward on this new feature?"

@tchemineau this seems like a big change, I think we still need to stabilize the current implementation more. It sounds like @troszok is still not seeing the same performance as LightGBM run in R which I need to investigate, and there is another issue where the learner seems to be getting stuck when using quantile regression which I need to debug. I'm also currently working on another project (deep learning with hierarchical attention networks) and will be out in July, so maybe sometime in August I'll have a lot more time to implement this. If you have cycles to tackle this, I would be glad to help out though.

dciborow commented 5 years ago

I believe this may also come up on Databricks, and that LightGBM does not work with autoscaling clusters. (heard a rumor...)

ezerhoun commented 5 years ago

@imatiach-msft I saw that there is a pull request for it, do you have any ETA about it ?