microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.05k stars 830 forks source link

why is the trainning speed of lightgbm on spark so slow compared with the local version? #751

Open janelu9 opened 4 years ago

janelu9 commented 4 years ago

How to improve the speed of lightgbm on spark?

bobbychen2000 commented 4 years ago

+1 on this question.

My training data has 600k rows, and less than 100 columns. I tried running on Spark using mmlspark package v1.0 with 1 driver node, 1 executor node, 8 cores, and it takes significantly longer time for the spark version to run through 1 boosting iteration as compared to local lightGBM package. Performance per boosting iteration is much worse if I do multiple executor nodes.

Would love to learn the reason for this, and hopefully get some pointers to get around the issue.

bobbychen2000 commented 4 years ago

@imatiach-msft Do you mind helping explain this?

Also wondering if there's a way to use spark interface and trigger local train if it's indeed slow(because of say, network communications)? I've tried forcing partition to 1, and limiting numCoresPerExec, and doesn't seem to help.

louis925 commented 4 years ago

You can try to use the pandas_udf() and use the python version lightgbm to train. It is much faster and you can still do parallelization if you are doing hyperparameter tuning.

siyuzhiyue commented 3 years ago

+1 on this question.

My training data has 600k rows, and less than 100 columns. I tried running on Spark using mmlspark package v1.0 with 1 driver node, 1 executor node, 8 cores, and it takes significantly longer time for the spark version to run through 1 boosting iteration as compared to local lightGBM package. Performance per boosting iteration is much worse if I do multiple executor nodes.

Would love to learn the reason for this, and hopefully get some pointers to get around the issue.

I have the same problem!!!

imatiach-msft commented 3 years ago

@siyuzhiyue have you tried setting numTasks = num machines, so one task run on each machine? What is your cluster configuration and what are your lightgbm parameters? I've found for some large parameter values on lightgbm reducing numTasks to equal the total number of machines helps reduce memory load and sometimes improves performance. Not sure if it will help in your case though.

zhangpanfeng commented 2 years ago

hi @imatiach-msft my case is similar, when I use one executor, the train time is 5 mins, when I use two executors, the train time is more than 5 mins but very close, when I use four executors, the train time is about 15 mins. my environment is: spark: 3.2.0 synapseml: 0.9.4 instance type: m5.9xlarge (18 cores, 36 threads, 72G memory) my spark config is: --driver-memory 4g --executor-cores 34 --num-executors {executor_num} --executor-memory 50g lightgbm config is: ` from synapse.ml.lightgbm import LightGBMRegressor model = LightGBMRegressor(objective='quantile', labelCol='sold_rolling', alpha=0.2, learningRate=0.3, metric='quantile', featureFraction=0.5, numLeaves=210-1, minDataInLeaf=26-1, maxDepth=-1, maxBin=500, numIterations=1000, boostFromAverage=True,

verbosity=-1,

                          verbosity=2,
                          earlyStoppingRound=100,
                          useSingleDatasetMode=False, # I have try set to true, but performance is worse
                          # chunkSize=5000,
                          numThreads=17,  #real cpu core is 18, hyper threading is 36
                          numTasks={executor_num}
                          )`

and I see some logs, maybe it is helpful for you: `4 Nodes [LightGBM] [Info] Finished linking network in 737.569538 seconds

2 Nodes [LightGBM] [Info] Finished linking network in 109.627021 seconds`

maybe it is network communication time, I am not sure

zhangpanfeng commented 2 years ago

I have tried synapseml 0.9.5, it is similar and even worse than 0.9.4 for my case

imatiach-msft commented 2 years ago

@zhangpanfeng how large is your dataset? Indeed, for very small datasets, the network communication overhead will actually make it slower to train, this is common knowledge. If you have a small dataset, it doesn't make sense to use spark or distributed compute since it will usually just make things slower due to the network communication overhead.

imatiach-msft commented 2 years ago

@zhangpanfeng also your driver memory is really low, I would recommend a higher value, based on:

--driver-memory 4g --executor-cores 34 --num-executors {executor_num} --executor-memory 50g

Also this setting is a bit strange. You have 18 cores, 72G memory per machine, yet you are using 34 cores overall (which is just 2 machines). I think you can increase executor cores and decrease memory to run more tasks on your cluster. Does each task use 1 core? That would speed things up a lot since you would use a lot more cores.

zhangpanfeng commented 2 years ago

hi @imatiach-msft, Thanks you. but as you know the value of --executor-cores is v-cores not the real CPU cores, but numThreads of synapseml is real CPU cores. how to make sure they match each other. and how to set numTasks of synapseml?

Does each task use 1 core? That would speed things up a lot since you would use a lot more cores. The lightgbm official website recommends users set up real CPU cores -1 as the numThreads on a distributed environment, so does each task use 1 core mean I run 17 task per executor and set numThreads=1 to make sure each task use 1 core? or the 1 core is 1 v-core, I run 34 task per executor, how to set numThreads?

timmhuang commented 1 year ago

Checking in to see if anyone made any new discoveries? I recently started experimenting with this on Databricks but I see up to 20% of the CPUs are used at once even after tweaking numThreads and repartitioning training data...

shuDaoNan9 commented 1 year ago

local 模式训练本机cpu会被拉满100%利用率就很快,分布式训练的话你要登录到每个slave节点去看cpu利用率就会发现利用率很低(spark3有所改善),不能只看spark history server上的并行度。分的越细通信开销也越大

victor-cattani-beegol commented 4 months ago

My dataset has about 400 million rows with 20 features (10 continuous and 10 categorical) and I'm also facing some performance problems to fit the model (it is taking so long to train the model). You have any advice on that? Is there something to do with the validation set (It has about 1.5 million rows)?

I'm using single node cluster on DataBricks with about 140Gb Ram and 20 Cores.

LightGBM Parameters:

model_cluster_0 = LightGBMRegressor(metric = 'mae', earlyStoppingRound=1, labelCol='target', validationIndicatorCol='validation_col', useSingleDatasetMode=True, numThreads=20).setParams(**dic_params_reg_model_0).fit(train_0)

@imatiach-msft could you help with such issue?Thanks!