szilard / GBM-multicore

GBM multicore scaling: h2o, xgboost and lightgbm on multicore and multi-socket systems
20 stars 1 forks source link

Root cause (NUMA), try MPI solutions #1

Open guolinke opened 7 years ago

guolinke commented 7 years ago

I only can see the unit of xgboost.

I think the main problem in NUMA is the memory access. And lightgbm / xgboost is not numa-aware.

For the multi-socket system, I think the better solution is running the parallel version of lightgbm/xgboost.

szilard commented 7 years ago

Yeah, for some reason ggplot2 does not print the other scales (but the scales are different).

Yeah, main reason non-NUMA awareness.

Maybe I should try the parallel version. You run multiple instances on same server confined to different sockets e.g. with taskset, right? I wonder if this makes things faster...

guolinke commented 7 years ago

yeah, I think using mpi can solve this: https://stackoverflow.com/questions/17604867/processor-socket-affinity-in-openmpi .

szilard commented 7 years ago

TODO: @ogrisel suggested https://github.com/dask/dask-xgboost

ogrisel commented 7 years ago

While NUMA itself can make things worse, it might also just be the case that fitting gradient boosting decision trees with the fast histogram binning tricks is just not arithmetic intensive enough to benefit from more than 16 physical cores on todays CPU / RAM and RAM bandwidth is the limit beyond 16 workers on that machine even if each worker is addressing chunks of data lying only their optimal NUMA node.

ogrisel commented 7 years ago

@szilard note that dask-xgboost is just a convenient way to start and access xgboost distributed on a cluster assuming you already have a dask distributed cluster up and running. Alternatively xgboost can also be used in distributed mode with Spark or via the CLI interface on a Hadoop cluster on AWS for instance.

szilard commented 7 years ago

@ogrisel Yeah, I agree, GBM is memory bandwidth limited, but usually the BW limit in hardware is increasing with the number of threads (sub-linearly of course). To see the above effect, we could run GBM on same NUMA node with increasing number of threads. The results in this repo already show that if you use hyperthreaded cores, you get worse performance. So you need to run increasing number of threads up to the max "real" cores on 1 socket (NUMA node). Unf. the above max is at most 16 for all AWS instances. I actually did runs with 4,8,16 threads and you can see sub-linear speed-up, but we don't know what would happen with 32 threads (on 32 "real" cores on 1 socket). Quick google search shows there are commodity processors with 24 cores but I have no access to such https://www.intel.com/content/www/us/en/products/processors/xeon/e7-processors.html

szilard commented 7 years ago

The results on 4,8,16 cores:

r4_8x:lightgbm:4:0-3:0.9443
r4_8x:lightgbm:8:0-7:0.6447
r4_8x:lightgbm:16:0-15:0.5293

All results here.

ogrisel commented 7 years ago

The LightGBM experiment reports a near linear speed-up up to 16x16 threads when running on a distributed setup:

https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel-experiment

So indeed RAM bandwidth is probably the limit in your case.

szilard commented 7 years ago

Yeah, that's on a gazillion records :) Most people work with 100K-1M-10M records usually (I think :) ). I used 10M records, and on that you might not even get any speedup, one needs to try (I did a while ago 1 node RF vs 5 nodes with h2o and it was no speedup).

Also would be interesting to try MPI over processes that run on same server but confined to different NUMA nodes (as mentioned above). Even if NUMA sucks, it should still be faster than a network (assuming MPI would use shared memory for messaging).

ogrisel commented 7 years ago

it should still be faster than a network.

The data would not move over the network while training: each worker has typically access to a local copy of the data. But the aggregate RAM bandwidth of the cluster can be much larger than what you get on a single machine hence the close to linear scalability.

szilard commented 7 years ago

Well, there is a cost to move the histograms over the network. Of course lightgbm/xgboost might do it differently, but 2 yrs ago on h2o: https://groups.google.com/forum/#!msg/h2ostream/bnyhPyxftX8/iU2qi14d9KAJ

Update: Sorry, the forum thread starts a bit higher up, please scroll up after going to the above link.

Laurae2 commented 7 years ago

10M dataset is too small for that many cores, it gets quickly diminishing returns: https://public.tableau.com/views/GradientBoostedTreesMegaBenchmark2/TabularTime?:showVizHome=no

image

image

szilard commented 7 years ago

Yeah, I was actually thinking I should repeat all that with 10x the dataset (maybe also 100x). But still, I think most people's use cases are datasets <10M records.

Laurae2 commented 7 years ago

@guolinke Do you know how to use LightGBM MPI locally in R?

@szilard yeah, 10M is a lot. Most cases are around 1K to 1M.

guolinke commented 7 years ago

@Laurae2 I think the parallel only can support in CLI version now.

But it may can be supported if we expose the c_api for network initialization: https://github.com/Microsoft/LightGBM/blob/master/src/application/application.cpp#L191