Open guolinke opened 7 years ago
Yeah, for some reason ggplot2 does not print the other scales (but the scales are different).
Yeah, main reason non-NUMA awareness.
Maybe I should try the parallel version. You run multiple instances on same server confined to different sockets e.g. with taskset
, right? I wonder if this makes things faster...
yeah, I think using mpi can solve this: https://stackoverflow.com/questions/17604867/processor-socket-affinity-in-openmpi .
TODO: @ogrisel suggested https://github.com/dask/dask-xgboost
While NUMA itself can make things worse, it might also just be the case that fitting gradient boosting decision trees with the fast histogram binning tricks is just not arithmetic intensive enough to benefit from more than 16 physical cores on todays CPU / RAM and RAM bandwidth is the limit beyond 16 workers on that machine even if each worker is addressing chunks of data lying only their optimal NUMA node.
@szilard note that dask-xgboost is just a convenient way to start and access xgboost distributed on a cluster assuming you already have a dask distributed cluster up and running. Alternatively xgboost can also be used in distributed mode with Spark or via the CLI interface on a Hadoop cluster on AWS for instance.
@ogrisel Yeah, I agree, GBM is memory bandwidth limited, but usually the BW limit in hardware is increasing with the number of threads (sub-linearly of course). To see the above effect, we could run GBM on same NUMA node with increasing number of threads. The results in this repo already show that if you use hyperthreaded cores, you get worse performance. So you need to run increasing number of threads up to the max "real" cores on 1 socket (NUMA node). Unf. the above max is at most 16 for all AWS instances. I actually did runs with 4,8,16 threads and you can see sub-linear speed-up, but we don't know what would happen with 32 threads (on 32 "real" cores on 1 socket). Quick google search shows there are commodity processors with 24 cores but I have no access to such https://www.intel.com/content/www/us/en/products/processors/xeon/e7-processors.html
The results on 4,8,16 cores:
r4_8x:lightgbm:4:0-3:0.9443
r4_8x:lightgbm:8:0-7:0.6447
r4_8x:lightgbm:16:0-15:0.5293
All results here.
The LightGBM experiment reports a near linear speed-up up to 16x16 threads when running on a distributed setup:
https://github.com/Microsoft/LightGBM/wiki/Experiments#parallel-experiment
So indeed RAM bandwidth is probably the limit in your case.
Yeah, that's on a gazillion records :) Most people work with 100K-1M-10M records usually (I think :) ). I used 10M records, and on that you might not even get any speedup, one needs to try (I did a while ago 1 node RF vs 5 nodes with h2o and it was no speedup).
Also would be interesting to try MPI over processes that run on same server but confined to different NUMA nodes (as mentioned above). Even if NUMA sucks, it should still be faster than a network (assuming MPI would use shared memory for messaging).
it should still be faster than a network.
The data would not move over the network while training: each worker has typically access to a local copy of the data. But the aggregate RAM bandwidth of the cluster can be much larger than what you get on a single machine hence the close to linear scalability.
Well, there is a cost to move the histograms over the network. Of course lightgbm/xgboost might do it differently, but 2 yrs ago on h2o: https://groups.google.com/forum/#!msg/h2ostream/bnyhPyxftX8/iU2qi14d9KAJ
Update: Sorry, the forum thread starts a bit higher up, please scroll up after going to the above link.
10M dataset is too small for that many cores, it gets quickly diminishing returns: https://public.tableau.com/views/GradientBoostedTreesMegaBenchmark2/TabularTime?:showVizHome=no
Yeah, I was actually thinking I should repeat all that with 10x the dataset (maybe also 100x). But still, I think most people's use cases are datasets <10M records.
@guolinke Do you know how to use LightGBM MPI locally in R?
@szilard yeah, 10M is a lot. Most cases are around 1K to 1M.
@Laurae2 I think the parallel only can support in CLI version now.
But it may can be supported if we expose the c_api for network initialization: https://github.com/Microsoft/LightGBM/blob/master/src/application/application.cpp#L191
I only can see the unit of xgboost.
I think the main problem in NUMA is the memory access. And lightgbm / xgboost is not numa-aware.
For the multi-socket system, I think the better solution is running the parallel version of lightgbm/xgboost.