microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.06k stars 832 forks source link

Add LightGBM learners to MMLSpark #173

Closed imatiach-msft closed 6 years ago

imatiach-msft commented 7 years ago

Add LightGBM learners to MMLSpark as spark Estimators and Transformers We can generate Java wrappers through SWIG

David082 commented 6 years ago

Hello,

Where can I find the package com.microsoft.ml.lightgbm?

Thank you for this great project!

imatiach-msft commented 6 years ago

Hi @David082, We will be uploading it to maven central as soon as we are done. It is currently in a private repository. Thank you, Ilya

imatiach-msft commented 6 years ago

@David082 I have checked in the code to generate the SWIG wrappers into the LightGBM repo. To build, you just need to run:

mkdir build ; cd build cmake -DUSE_SWIG=ON .. make -j4

This will generate a jar file containing the LightGBM c_api wrapped by SWIG. You will need to load the native .so files as in the lightgbm branch before calling into the Java API. I am currently working on the LightGBM estimator in the lightGBM branch in mmlspark - I am currently doing performance testing and optimization.

imatiach-msft commented 6 years ago

@David082 LightGBM is now available in maven central: https://repo.maven.apache.org/maven2/com/microsoft/ml/lightgbm/lightgbmlib/ You can import it with sbt via: "com.microsoft.ml.lightgbm" % "lightgbmlib" % "2.0.120"

David082 commented 6 years ago

@imatiach-msft Thank you!

ezerhoun commented 6 years ago

Hello,

I saw that LightGBMClassifier has been added LightGBMClassifier Do you plan to add also a LightGBMRegressor ?

Thank you very much

imatiach-msft commented 6 years ago

Yes, I will add a separate LightGBMRegressor in the near future, I think it should be trivial. I'm also going to extend the current classifier to support multiclass in addition to binary labels, this should be easy to do as well. Adding the ranker may take some planning and thought though in order to make it fit into the spark ecosystem.

ezerhoun commented 6 years ago

@imatiach-msft Thank you !

troszok commented 6 years ago

@imatiach-msft great job with the classifier, thank you very much! We are waiting now for the regressor as well.

imatiach-msft commented 6 years ago

@troszok Thanks! The classifier has been checked in and I have created a PR to add the regressor here: https://github.com/Azure/mmlspark/pull/249

imatiach-msft commented 6 years ago

@troszok the LightGBM regressor has been merged into master, we would love to hear any feedback you might have for the interface and the parameters we currently expose - in my testing the LightGBMClassifier learner was 10-30% faster in execution time than spark's GBTClassifier and had 15% better AUC on the Higgs dataset

ezerhoun commented 6 years ago

@imatiach-msft Thank you for adding LightGBM regressor !

I am trying to train a LightGBM Quantile regressor. However, I don't see how to pass the quantile parameter (alpha parameter in LightGBM). Do you have an example of how to do it ?

imatiach-msft commented 6 years ago

@ezerhoun thank you for the great feedback! I've created a PR which adds the alpha parameter for lightGBMRegressor and a pyspark notebook example for how to use lightGBM with quantile regression: https://github.com/Azure/mmlspark/pull/254 It's currently in code review. If you need to add more params but can't wait for us to add them to the learner you can also put the param=value inside the "application" parameter, eg:

df = ... reg = LightGBMRegressor(application='quantile alpha=0.1') model = reg.fit(df)

it must be in the format that lightgbm accepts

ezerhoun commented 6 years ago

@imatiach-msft Thank you for your quick reply and PR ! Moreover, I have couple of other questions:

imatiach-msft commented 6 years ago

@ezerhoun Is the scenario to train in spark but score in a java environment? In that case, I think it is possible 1.) For now, you can use the maven coordinates that are produced through the build but you will need to add our resolver or repository (https://mmlspark.azureedge.net/maven) -- as soon as we have a new release you will be able to use coordinates from maven central as explained in the main page 2.) Yes, it should be possible - when you save the model in spark in one of the folders there will be a text file representing the model which you can then parse using either LightGBM's native C++ api, LightGBM's load model in python or R, or the load model in the Java jar that is in maven

imatiach-msft commented 6 years ago

@ezerhoun for now, you can use the following options when running spark: --packages com.microsoft.ml.spark:mmlspark_2.11:0.11.dev31+1.gf23ab7852 --repositories https://mmlspark.azureedge.net/maven please let me know if you have questions or comments

ezerhoun commented 6 years ago

@imatiach-msft Thank you for your replies . Indeed the idea would be to train in Spark and test in Java. I don't know if it is the right place to ask but I have the following error :

Py4JJavaError: An error occurred while calling o673.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in stage 47.1 failed 4 times, most recent failure: Lost task 4.3 in stage 47.1 (TID 3569, 10.233.104.114, executor 33): java.lang.Exception: Network init call failed in LightGBM with error: Machine list file doesn't contain the local machine

Should I do as in LightGBM Parallel Learning Guide and create a mlist.txt

Thanks !

imatiach-msft commented 6 years ago

@ezerhoun I believe I have figured out the issue, there was a bug in my new code. I've created a new build, will send you the new package when it is finished.

imatiach-msft commented 6 years ago

@ezerhoun the fixed package is: --packages com.microsoft.ml.spark:mmlspark_2.11:0.11.dev31+1.g4ad3176c9 --repositories https://mmlspark.azureedge.net/maven.

troszok commented 6 years ago

@imatiach-msft great news!!! I forgot to check the progress here. We will try to do POC with new model and we will let you know!

ezerhoun commented 6 years ago

@imatiach-msft I tried the new package and I still have the same problem.

imatiach-msft commented 6 years ago

@ezerhoun could you give me information about your cluster setup? I have validated the following scenarios: 1.) Local tests (running under local[*]) 2.) Running on HDI with yarn and # partitions < # executors (this was failing before the latest PR) 3.) Running on HDI with yarn and # partitions >= # executors (this worked before but my PR broke it yesterday, and I fixed it with the latest package I sent above) I am wondering if there is some cluster setup that does not work with the current code.

imatiach-msft commented 6 years ago

@ezerhoun it might be faster to discuss over skype. You can email: mmlspark-support@microsoft.com and I can send you a link to a skype meeting

mhamilton723 commented 6 years ago

@imatiach-msft should this be closed?

deliverator23 commented 6 years ago

Any idea when the mmlspark v0.12 that includes the new LightGBMClassifier will be released to maven central?

elibarzilay commented 6 years ago

@iainmillar23 Hopefully by Monday (ish).

imatiach-msft commented 6 years ago

@mhamilton723 I think we can leave this open for now, maybe for at least 2 release cycles, since I would still like to add multiclass classification and possibly the ranker

ekaterina-sereda-rf commented 6 years ago

Hi!

--packages com.microsoft.ml.spark:mmlspark_2.11:0.11.dev31+1.g4ad3176c9 --repositories https://mmlspark.azureedge.net/maven.

I tried to use this packages and was not successful on downloading dependency. Did something change?

elibarzilay commented 6 years ago

We're going to release 0.12 very soon, it'll have that with more thing. (It got delayed with some technical issues.)

ekaterina-sereda-rf commented 6 years ago

Thank you, waiting for that!

ekaterina-sereda-rf commented 6 years ago

Hi! I have one more question. I found development version that you posted earlier, but now i faced problem, that i can't pass all parameters that i need. max_bin, bagging_fraction,bagging_freq,bagging_seed,feature_fraction,max_depth,min_sum_hessian_in_leaf i tried to change your code, and added them to parent parameter list to the trait LightGBMParams. so i see parameters set now, but i don't see effect on training. Are you going to support them? do you have quick advice for me what also could be changed in lib so these parameters will be taken in account?

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf could you please share your file changes? Those parameters should be supported. You could submit a PR if you like, or I can add those parameters, whichever you prefer. Also, to update on the release, we found some issues in the 0.12 release which is causing some delays, sorry about that.

Ragavenderan commented 6 years ago

Can we train this in Azure batch?

peay commented 6 years ago

@imatiach-msft I gave the bindings a try, but I ran into some issues when two Spark executors got scheduled on the same worker in a standalone Spark cluster because https://github.com/Microsoft/LightGBM/blob/7d3206e0a43cb7e65f846338a055c298dac55b90/src/network/linkers_socket.cpp#L126 seems to use a fixed port. The first executor did bind to that port, but then the second couldn't. Any advice on how to handle these situations?

imatiach-msft commented 6 years ago

@peay this should supposedly not be an issue if you are using the LightGBMClassifier or LightGBMRegressor -- from here we get all "executor:port" from all executors in the current spark context and pass that to the network init method in LightGBM:

https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L57

so supposedly if you have two executors on the same worker you should be using the same executor ip address twice and two different port numbers (the port value is 12400 + executor id). If you are seeing issues then that must mean there is a bug in this logic. Is there a way I could get a repro of your setup? I noticed another user above encountered a similar issue but unfortunately was unable to give me a repro.

imatiach-msft commented 6 years ago

"I gave the bindings a try"

@peay sorry, maybe I misunderstood, are you specifically using the scala learners LightGBMClassifier and LightGBMRegressor or the Java bindings from the LightGBM maven artifact directly? https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMClassifier.scala https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMRegressor.scala

ekaterina-sereda-rf commented 6 years ago

@imatiach-msft sorry for long absence. So here is my changes https://gist.github.com/ekaterina-sereda-rf/b7b718da8a14f755b2dfcea62cf54123 Question is - is it enough to pass it engine? (of course I set them in my code after it) And also I have question about performance. Do you have some statistic about how long should model should train on different amount of data, with different params so on? p.s. i wanted to do pull request, but i saw that i should subscribe something, so on - so i though this way is more easy. but if you'll help me to start i can do it with pleasure. Thank you!

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf this looks great! Note you will need to pass those params to the underlying LightGBM C++ api here (similar to the other parameters): https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/TrainParams.scala#L15 If you like I can create a pull request with your changes, but if you are interested in doing this you can also contribute, you just need to fork the mmlspark repo, make the code changes and then send out a PR. Please let me know what you prefer. Thanks!

imatiach-msft commented 6 years ago

@ekaterina-sereda-rf one nit-pick, could you remove the "_" in the parameter names to be consistent with the other parameters? Also, for your question about performance, I've only tested this on the Higgs dataset as was recommended by @guolinke, and the results showed that LightGBM was faster than GBTClassifier by 10-30% and had a better AUC by 15%. I'm sure we will need to do more performance testing in the future, but if you see any issues on any datasets please let me know.

katerinagl commented 6 years ago

Thank you for your answers! yes, I did not add it to trainParams before, so I think it will help. I created PR https://github.com/Azure/mmlspark/pull/282 , please have a look and let me know if I need to fix something. And about performance I'll let you know when will have fresh results.

peay commented 6 years ago

Ah, I see -- I was a bit too quick to assume executor collisions then: the fact that executors were running on the same node may have been a coincidence.

I've investigated a bit more, and here is a more accurate description of what I run into:

In all cases, the error is

Py4JJavaError: An error occurred while calling o5186.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 213.1 failed 4 times, most recent failure: Lost task 8.3 in stage 213.1 (TID 11285, 10.233.78.219, executor 0): java.lang.Exception: Network init call failed in LightGBM with error: Binding port 12400 failed
    at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:16)
    at com.microsoft.ml.spark.TrainUtils$.trainLightGBM(TrainUtils.scala:141)
    at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$1.apply(LightGBMRegressor.scala:70)
    at com.microsoft.ml.spark.LightGBMRegressor$$anonfun$1.apply(LightGBMRegressor.scala:70)
    at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:186)
    at org.apache.spark.sql.execution.MapPartitionsExec$$anonfun$6.apply(objects.scala:183)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The port varies -- executor 0 here, but sometimes others. Running about 10 executors most of the time. Pinning it down to one executor only, I am not able to reproduce but it may also be by chance.

I will try to investigate more this week, and get to the logs of the executors.

This is a Spark standalone cluster, running Spark 2.1.1 on-premise on top of kubernetes.

imatiach-msft commented 6 years ago

@peay one possibility is that port number, 12400, is actually in use - currently there is no logic to find open ports on the workers and each worker just opens port ("defaultListenerPort" + executor_id), this parameter is defined here: https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMParams.scala#L19 And the logic to calculate the port number is here: https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L91 One workaround may be to set the default listener port parameter to a slightly higher value, eg 12420, and see if that makes any difference. Also, the way I get the executors is via the blockManager if the number of executors <= num partitions after doing a coalesce on the dataset to the number of executors: https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L59 Then this is followed by a map partitions to do the actual training. Otherwise, if num executors > num partitions, I do a map partitions twice, the first time to get the executor ids and the second time to do the training logic: https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMUtils.scala#L128 This would only work if the same executors are run twice, which in my testing was always the case. How many partitions do you have in your dataset? If it is >= # executors, could you also try to repartition to # executors - 1 partitions and see if the training works? I think it may be more reliable to do the map partitions twice but there is an execution time performance cost unfortunately.

imatiach-msft commented 6 years ago

@iainmillar23 @ekaterina-sereda-rf v0.12 has been released for spark 2.2.1 if you want to try it out

imatiach-msft commented 6 years ago

@peay I'm going to try to reproduce the issue, would you be able to give me your yaml file too? I will try to create the cluster on AKS (I've been using HDInsight so far for testing).

deliverator23 commented 6 years ago

Does v0.12 required Spark 2.2.1 or will it work with Spark 2.1.0?

imatiach-msft commented 6 years ago

@iainmillar23 It should work with 2.1.0, although our source code is using 2.2.1 they are backwards compatible. However once we move to 2.3.0 it won't be because OneHotEncoder was changed to OneHotEncoderEstimator.

imatiach-msft commented 6 years ago

@iainmillar23 @ekaterina-sereda-rf sorry, it looks like we haven't yet uploaded to spark packages, we will be doing that later today - so some of the installation instructions might not work yet. For example: spark-shell --packages Azure:mmlspark:0.12 Does not work yet.

imatiach-msft commented 6 years ago

@peay I tried the latest 0.12 on HDInsight with LightGBMClassifier on Higgs dataset and it worked fine, I'm still trying to create the AKS cluster. One useful log to look for in the yarnui might be the list of nodes and ports:

18/04/17 16:03:48 INFO LightGBMClassifier: Nodes used for LightGBM: 10.0.0.10:12454,10.0.0.11:12458,10.0.0.13:12453,10.0.0.14:12451,10.0.0.14:12459,10.0.0.15:12455,10.0.0.16:12452,10.0.0.16:12460,10.0.0.17:12457,10.0.0.18:12456

This might help diagnose the issue. You should see 10 entries if you are using 10 executors.

imatiach-msft commented 6 years ago

@iainmillar23 @ekaterina-sereda-rf the spark package has been uploaded, so the release is complete, in case you want to try out LightGBMClassifier/Regressor

ekaterina-sereda-rf commented 6 years ago

@imatiach-msft Thank you very much :) already tried it today. and have new questions.

  1. in general, seems we also saw Network init call failed in LightGBM with error: Binding port 12400 failed at com.microsoft.ml.spark.LightGBMUtils$.validate(LightGBMUtils.scala:16) in some cases, when we used more then one machine - but I can't describe conditions now.(will do if i see it again)
  2. Here and here - seems you paralelize everything by number of executors but not by number of cores - does it have some explanation in field of inner calculation? because i changed it manually to number of cores we use - and it works much much faster. now we testing accuracy. Any way, thank you very much for your great job.