Performance issue on Apple M2 Pro w/ macOS

Description

I have read some tickets and understand that LightGBM is not optimized for Apple Silicon. But I decided to report on my findings in the hope that this report might be of use to users and development teams. Apologies if this is a known issue.

I found some interesting behavior with Apple M2 Pro SoC with macOS. By default, CPU load average during training is not high on that SoC and does not seem to achieve the best performance. More specifically, I also have Apple M1 SoC machine and assumed that with the increase in core count (8 -> 12 Cores), there would be an increase in performance, but it has not.

After some trials, I found that specifying OMP_WAIT_POLICY=active will improve performance (about 1.6x faster) on Apple M2 Pro SoC. However, it does not work well on Apple M1 SoC (= training time will not be reduced).

Reproducible example

I have prepared the following code for benchmarking. The following code uses scikit-learn to generate binary classification pseudo-data to benchmark LightGBM.

import contextlib
import logging
import time

import lightgbm as lgb
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold

LOGGER = logging.getLogger(__name__)

@contextlib.contextmanager
def stopwatch():
    start = time.time()
    LOGGER.info("start: LightGBM")

    yield

    end = time.time()
    LOGGER.info("end: LightGBM")

    elapsed = end - start
    LOGGER.info("elapsed time: %.2f sec", elapsed)

def main():
    log_fmt = (
        "%(process)d %(threadName)s %(name)s %(levelname)s %(message)s"
    )
    logging.basicConfig(level=logging.INFO, format=log_fmt)

    args = {
        "n_samples": 100_000,
        "n_features": 100,
        "n_informative": 20,
        "n_redundant": 0,
        "n_repeated": 0,
        "class_sep": 0.65,
        "n_classes": 2,
        "random_state": 42,
        "shuffle": False,
    }
    x, y = make_classification(**args)

    folds = StratifiedKFold(
        n_splits=5,
        shuffle=True,
        random_state=42,
    )

    lgb_train = lgb.Dataset(x, y)
    lgb_params = {
        "objective": "binary",
        "metric": "binary_logloss",
        "seed": 42,
        "deterministic": True,
        "verbose": -1,
    }

    callbacks = [
        lgb.log_evaluation(
            period=1_000,
            show_stdv=True,
        ),
    ]
    with stopwatch():
        lgb.cv(
            params=lgb_params,
            train_set=lgb_train,
            num_boost_round=10_000,
            callbacks=callbacks,
            folds=folds,
        )

if __name__ == "__main__":
    main()

Here are the benchmark results for each environment.

Apple M2 Pro (4 Efficient Cores + 8 Performance Cores) w/ macOS

The environment is as follows.

$ sysctl -a | grep brand_string
machdep.cpu.brand_string: Apple M2 Pro
$ uname -srm
Darwin 22.3.0 arm64
$ sw_vers                      
ProductName:        macOS
ProductVersion:     13.2.1
BuildVersion:       22D68

First, let's look at the case of executing the code without specifying anything:

$ python benchmark.py 
4432 MainThread __main__ INFO start
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
4432 MainThread __main__ INFO end
4432 MainThread __main__ INFO elapsed time: 387.20 sec

The benchmark result is 387 sec.

The CPU load history at this time is the following. Performance Cores (Core No.5 ~ 12) are used in waves, about 50 ~ 60%, but Efficient Cores (No.1 ~ 4) are scarcely used.

M2-Pro-Default

Next, with OMP_WAIT_POLICY=active:

$ OMP_WAIT_POLICY=active python benchmark.py 
79174 MainThread __main__ INFO start: LightGBM
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
79174 MainThread __main__ INFO end: LightGBM
79174 MainThread __main__ INFO elapsed time: 235.09 sec

The time for learning has been reduced from 387 sec to 235 sec (about 1.6x faster).

When executed, the CPU cores into spin loop. The CPU load history at this time is the following.

M2-Pro-OMP_WAIT_POLICY=active

FYI, additional OMP_NUM_THREADS=11 is specified, the time will be reduced a little more: (NOTE: 11 is the number of cores of Apple M2 Pro SoC - 1)

$ OMP_WAIT_POLICY=active OMP_NUM_THREADS=11 python benchmark.py
79284 MainThread __main__ INFO start: LightGBM
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
79284 MainThread __main__ INFO end: LightGBM
79284 MainThread __main__ INFO elapsed time: 221.44 sec

Apple M1 (4 Efficient Cores + 4 Performance Cores) w/ macOS

Next, I compared with Apple M1 SoC. The environment is as follows.

$ sysctl -a | grep brand_string
machdep.cpu.brand_string: Apple M1
$ uname -srm
Darwin 22.3.0 arm64
$ sw_vers                      
ProductName:        macOS
ProductVersion:     13.2.1
BuildVersion:       22D68

Run without specifying anything:

$ python benchmark.py 
50564 MainThread __main__ INFO start
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
50564 MainThread __main__ INFO end
50564 MainThread __main__ INFO elapsed time: 428.36 sec

The result is 428 sec. Despite the fact that the number of performance cores is half against Apple M2 Pro, the learning time is not significantly different (vs 387 sec).

The CPU load history at this time is the following. It was roughly the same trend as Apple M2 Pro.

M1-Default

And OMP_WAIT_POLICY=active does not work well on Apple M1.

$ OMP_WAIT_POLICY=active python benchmark.py
87996 MainThread __main__ INFO start: LightGBM
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
87996 MainThread __main__ INFO end: LightGBM
87996 MainThread __main__ INFO elapsed time: 468.49 sec

Training time is longer than when not specified. Note that CPU cores into spin loop properly.

M1-OMP_WAIT_POLICY=active

Intel Core i7-8700B (6 Cores with SMT) w/ macOS

I can also use Intel Mac, I compared them for reference.

$ sysctl -a | grep brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz
$ uname -srm
Darwin 22.3.0 x86_64
$ sw_vers                      
ProductName:        macOS
ProductVersion:     13.2.1
BuildVersion:       22D68

Training time is 420sec, close to Apple M1. Compared to typical CPU benchmark scores, this result should be uncomfortable feeling.

$ python benchmark.py 
19642 MainThread __main__ INFO start
[1000]  cv_agg's binary_logloss: 0.123578 + 0.00186002
[2000]  cv_agg's binary_logloss: 0.107131 + 0.00233912
[3000]  cv_agg's binary_logloss: 0.110841 + 0.00293728
[4000]  cv_agg's binary_logloss: 0.122046 + 0.00382834
[5000]  cv_agg's binary_logloss: 0.13687 + 0.00474103
[6000]  cv_agg's binary_logloss: 0.152749 + 0.00592566
[7000]  cv_agg's binary_logloss: 0.167346 + 0.00691806
[8000]  cv_agg's binary_logloss: 0.173766 + 0.00690977
[9000]  cv_agg's binary_logloss: 0.17714 + 0.00691034
[10000] cv_agg's binary_logloss: 0.179344 + 0.00697653
19642 MainThread __main__ INFO end
19642 MainThread __main__ INFO elapsed time: 420.24 sec

The CPU load history at this time is the following. It seems to be completely utilized by default.

Environment info

LightGBM version or commit hash: 3.3.5

Command(s) you used to install LightGBM

$ pip install lightgbm

$ sysctl -a | grep brand_string
machdep.cpu.brand_string: Apple M2 Pro
$ uname -srm
Darwin 22.3.0 arm64
$ sw_vers                      
ProductName:        macOS
ProductVersion:     13.2.1
BuildVersion:       22D68
$ brew info libomp | head -n 1
==> libomp: stable 15.0.7 (bottled) [keg-only]

Additional Comments

I was curious if this behavior was limited to LightGBM. So I ran the same type benchmark with XGBoost. The environment was Apple M2 Pro SoC with macOS.

import contextlib
import logging
import time

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold

LOGGER = logging.getLogger(__name__)

@contextlib.contextmanager
def stopwatch():
    start = time.time()
    LOGGER.info("start: XGBoost")

    yield

    end = time.time()
    LOGGER.info("end: XGBoost")

    elapsed = end - start
    LOGGER.info("elapsed time: %.2f sec", elapsed)

def main():
    log_fmt = (
        "%(process)d %(threadName)s %(name)s %(levelname)s %(message)s"
    )
    logging.basicConfig(level=logging.INFO, format=log_fmt)

    args = {
        "n_samples": 100_000,
        "n_features": 100,
        "n_informative": 20,
        "n_redundant": 0,
        "n_repeated": 0,
        "class_sep": 0.65,
        "n_classes": 2,
        "random_state": 42,
        "shuffle": False,
    }
    x, y = make_classification(**args)

    folds = StratifiedKFold(
        n_splits=5,
        shuffle=True,
        random_state=42,
    )

    xgb_train = xgb.DMatrix(x, label=y)
    xgb_callbacks = {
        xgb.callback.EvaluationMonitor(
            period=1_000,
        )
    }
    xgb_params = {
        "objective": "binary:logistic",
        "eval_metric": "logloss",
        "tree_method": "hist",
        "seed": 42,
    }

    with stopwatch():
        xgb.cv(
            params=xgb_params,
            dtrain=xgb_train,
            num_boost_round=10_000,
            folds=folds,
            callbacks=xgb_callbacks,
        )

if __name__ == "__main__":
    main()

I did not think it would be meaningful to compare learning times, so I observed CPU load history.

$ python benchmark.py
...

The CPU load history at this time is the following.

XGBoost

Even XGBoost does not seem to be able to utilized all CPU cores. Performance Cores show high-frequency waves slightly similar to LightGBM. However, Efficient Cores are properly utilized.

Please let me know if there is anything I should investigate further. I am sorry if I made a fundamental misunderstanding something.

microsoft / LightGBM