Open momijiame opened 1 year ago
@momijiame Thanks for the very in-depth benchmarking. In LightGBM, we don't differentiate between different types of CPU cores like Performance or Efficiency cores in Mac Silicon. Every thread is treated equally. Is it a possible reason that not all cores in Mac Silicon fully utilized?
@shiyu1994 I do not know what is causing this issue. Although it is not an exact comparison, it seems that Intel Core i7-12700 CPU (8 performance cores + 4 efficient cores), which uses the same heterogeneous architecture, can be utilized the CPU cores with Linux. Sorry if I am missing the point, but I think that comparing the source code with XGBoost might give some clues. The reason is that they are GBDT frameworks using OpenMP.
Description
I have read some tickets and understand that LightGBM is not optimized for Apple Silicon. But I decided to report on my findings in the hope that this report might be of use to users and development teams. Apologies if this is a known issue.
I found some interesting behavior with Apple M2 Pro SoC with macOS. By default, CPU load average during training is not high on that SoC and does not seem to achieve the best performance. More specifically, I also have Apple M1 SoC machine and assumed that with the increase in core count (8 -> 12 Cores), there would be an increase in performance, but it has not.
After some trials, I found that specifying
OMP_WAIT_POLICY=active
will improve performance (about 1.6x faster) on Apple M2 Pro SoC. However, it does not work well on Apple M1 SoC (= training time will not be reduced).Reproducible example
I have prepared the following code for benchmarking. The following code uses scikit-learn to generate binary classification pseudo-data to benchmark LightGBM.
Here are the benchmark results for each environment.
Apple M2 Pro (4 Efficient Cores + 8 Performance Cores) w/ macOS
The environment is as follows.
First, let's look at the case of executing the code without specifying anything:
The benchmark result is 387 sec.
The CPU load history at this time is the following. Performance Cores (Core No.5 ~ 12) are used in waves, about 50 ~ 60%, but Efficient Cores (No.1 ~ 4) are scarcely used.
Next, with
OMP_WAIT_POLICY=active
:The time for learning has been reduced from 387 sec to 235 sec (about 1.6x faster).
When executed, the CPU cores into spin loop. The CPU load history at this time is the following.
FYI, additional
OMP_NUM_THREADS=11
is specified, the time will be reduced a little more: (NOTE: 11 is the number of cores of Apple M2 Pro SoC - 1)Apple M1 (4 Efficient Cores + 4 Performance Cores) w/ macOS
Next, I compared with Apple M1 SoC. The environment is as follows.
Run without specifying anything:
The result is 428 sec. Despite the fact that the number of performance cores is half against Apple M2 Pro, the learning time is not significantly different (vs 387 sec).
The CPU load history at this time is the following. It was roughly the same trend as Apple M2 Pro.
And
OMP_WAIT_POLICY=active
does not work well on Apple M1.Training time is longer than when not specified. Note that CPU cores into spin loop properly.
Intel Core i7-8700B (6 Cores with SMT) w/ macOS
I can also use Intel Mac, I compared them for reference.
Training time is 420sec, close to Apple M1. Compared to typical CPU benchmark scores, this result should be uncomfortable feeling.
The CPU load history at this time is the following. It seems to be completely utilized by default.
Environment info
LightGBM version or commit hash: 3.3.5
Command(s) you used to install LightGBM
Additional Comments
I was curious if this behavior was limited to LightGBM. So I ran the same type benchmark with XGBoost. The environment was Apple M2 Pro SoC with macOS.
I did not think it would be meaningful to compare learning times, so I observed CPU load history.
The CPU load history at this time is the following.
Even XGBoost does not seem to be able to utilized all CPU cores. Performance Cores show high-frequency waves slightly similar to LightGBM. However, Efficient Cores are properly utilized.
Please let me know if there is anything I should investigate further. I am sorry if I made a fundamental misunderstanding something.