microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 832 forks source link

Add LightGBM learners to MMLSpark #173

Closed imatiach-msft closed 6 years ago

imatiach-msft commented 7 years ago

Add LightGBM learners to MMLSpark as spark Estimators and Transformers We can generate Java wrappers through SWIG

ekaterina-sereda-rf commented 6 years ago

And one more thing. Do you have some insights about memory usage? for example i have 10gb of data for training and 100 features - how much memory should i have to finish training?(i don't know if i formulated correctly, i understand that there is a lot of factors, but if you have some examples that you can share - it could really help me)

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf yes memory usage is definitely an issue, as the data needs to be copied to lightgbm into native code in order to train the distributed learner - so basically we create a second copy of the dataset on each partition. I assume the 10 GB dataset would use about 10 GB memory for spark and another 10 GB memory for lightGBM, so approximately 20 GB of data. Are you seeing any memory issues outside of 2X cost?

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf yes, currently I am parallelizing by number of executors. Performance may be improved more by parallelizing by number of cores. In my testing I used a cluster with one core per executor. I discussed with @mhamilton723 and it sounds like one executor can process multiple partitions at once on all threads. If that is the case then when doing network init we can pass each thread instead of each executor to get a significant speedup in cases where there are multiple cores per executor. I am also considering modifying the code to either push rows in parallel into lightgbm, but this may involve some changes in the lightgbm code, or have singletons per executor and push native data there on all threads in one map partitions and then do a separate map partitions to push that native data to create a lightgbm dataset - this might be more robust but I'm not sure if the performance would be better as we would need to do two map partitions calls, and the code would be much more complex.

deliverator23 commented 6 years ago

Is v0.12 going to be released to maven central? I can't see it there yet.

katerinagl commented 6 years ago

Hi :) so, i run training, on pure your code (without my changes). i have no data prep - i have ready data with features and load it from parquet, it's only step on my side. i have 1 executor with 400gb of ram and 64 core, and for last 5+ hours i see this line in logs. INFO LightGBMRegressor: LightGBM worker listening on: 12450 - so as i see iterations even did not start. for sure as we discuss above only one core occupied, and there is more then 200 gb of RAM occupied (to question about memory), my data set is about 20gb. maybe i do something wrong? maybe there is a way to see more logs? something on c++ side?

katerinagl commented 6 years ago

@iainmillar23 i use this resolver - and v0.12 was loaded resolvers += "MMLSpark Repo" at "https://mmlspark.azureedge.net/maven"

imatiach-msft commented 6 years ago

@katerinagl @iainmillar23 it should eventually be in maven central like the other versions here: https://mvnrepository.com/artifact/Azure/mmlspark @elibarzilay I believe all spark packages are added to maven central, is this correct? It may take an extra day or so for it to show up as we just released a day ago.

imatiach-msft commented 6 years ago

@katerinagl interesting, I'm guessing there must be a bug. I would need to reproduce the issue somehow to understand what is going on. If you use a smaller dataset do you see the same behavior? Our sample notebooks in the docker run on a single executor so LGBM would fail in those if that was the issue, and I do have some tests (the fuzzing tests) that run on a single executor: https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/test/scala/VerifyLightGBMClassifier.scala#L111 If you run with Spark ML's GBTClassifier does the code work? Does using 64 executors, one per core, mitigate the issue?

imatiach-msft commented 6 years ago

@katerinagl @iainmillar23 the package is in maven central now - you can see v0.12 there now: https://mvnrepository.com/artifact/Azure/mmlspark It seems it just takes a couple of days.

imatiach-msft commented 6 years ago

@katerinagl @peay @ezerhoun I tried running lightGBM in an azure databricks cluster with a configuration of 3 worker nodes, 3 executors and 8 cores per executor. I was hoping to reproduce the problem with network init by running multiple cores per executor (as opposed to HDI case where there was 1 core per executor) but LightGBM completed. However, I found that it was 2X slower than GBTClassifier in this case (160 seconds GBT vs 311 seconds LGBM), I think it only uses one of the cores. I think the issue @katerinagl ran into when running just one executor is related, I would recommend using as many executors as cores if you use LightGBM. I think I might be able to figure out a way to optimize and fully utilize all cores per executor when there is more than 1 by spawning socket connections on each thread and opening up multiple ports per executor.

katerinagl commented 6 years ago

Hi:) so when i run multiple executors on one machine i run into the problem with ports, so i need to change this moment, and bind as many addresses as many executors i have. this works. but when i split one machine into, several executors, i fell in problems with memory. so for me is still big question what is the difference from algorithm/calculation perspective when i run several executors or i run several cores on one executor - i was succeeded with both variants after my changes, but in first case i had bad accuracy, on small dataset i had all predictions 0.0 (and the same case, same data with 1 core made at least some predictions) and with second case(several executors) i did not complete testing yet. i used 4 executors and it made some results. i don't know what about 4 cores. i hope there is not too much mess here. in gerenal if you can provide me datasets that you use for testing.(i tried to find it in repo, but did not see it)

imatiach-msft commented 6 years ago

@katerinagl I've put the azure databricks and HDInsight notebooks here for the Higgs dataset in my branch (note I will not be checking these in): https://github.com/imatiach-msft/mmlspark/tree/ilmat/higgs-notebooks/notebooks/higgs You can get the Higgs csv dataset here: https://github.com/guolinke/boosting_tree_benchmarks/tree/master/data Specifically this link: https://archive.ics.uci.edu/ml/datasets/HIGGS

My TODOs right now are: 1.) I think I need to spend time investigating how I can optimize for the multi-core per executor scenario, my intuition is that I need to open a socket per thread instead of per executor as I am doing right now when calling network init; I'm not sure if each call inside the mapPartitions is per executor or per thread - it sounds like it is per thread. Also, another thing I might try out is keeping a global singleton per executor where I would push the data to (and I would not need to coalesce the data in that case), and then doing the training afterwards 2.) I also still need to find a way to repro the network_init failures some people are seeing. The only way that I know to repro it is to run multiple lightGBM learners in succession -- the ports take some time to free, but there is an easy mitigation which is to give different base port values in the params when sweeping over the learner. Eventually it would be good to implement a way to search for free ports on the executors to reduce the chance of failure during network init.

elibarzilay commented 6 years ago

Yes, I think that spark packages will put the jar file into maven central, but probably not the lightgbm thing?

imatiach-msft commented 6 years ago

@elibarzilay everything is already in maven central, both the lightgbm SWIG Java wrapper (which has been there for a while) and the new mmlspark v0.12 release, that issue is now resolved

imatiach-msft commented 6 years ago

@katerinagl the length of this thread is quite long, and I'm now noticing some performance issues on github. I've created a separate thread to track the performance issues with lightGBM when using clusters configured with multiple cores per executor here: https://github.com/Azure/mmlspark/issues/292 I would also like to eventually get a repro of the network init failures some people are having, but I haven't been able to reproduce this yet.

mhamilton723 commented 6 years ago

Closing this as they are already merged and issues are encapsulated in other threads