Benchmarks of PyTorch on Apple Silicon.
This is a work in progress, if there is a dataset or model you would like to add just open an issue or a PR.
Create conda env with python compiled for osx-arm64
and activate it with:
CONDA_SUBDIR=osx-arm64 conda create -n native python -c conda-forge
conda activate native
and install pytorch
nightly build with:
pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu
and finally install datasets
and transformers
with:
pip install transformers datasets
30W
.46W
.310W
.Run the experiments by yourself with:
python tests/transformers_sequence_classification.py \
--device <cpu|cuda|mps> \
--pre_trained_name <bert-base-cased|bert-large-cased> \
--batch_size <32|64|128> \
--mode <training|inference> \
--steps 100 \
--sequence_length <128|512>
The following tables show the time needed to complete 100 steps without gradient accumulation. -
means that the script went out of memory. All experiments have been run with float32
.
bert-base-cased
1.13.0.dev20220601
1.13.0.dev20220604
2.2.0.dev20231113
2.2.0.dev20231113
Training:
Batch size | Sequence length | M1 Max CPU (32GB) | M1 Max GPU 32-core (32GB) | M1 Ultra 48-core (64GB) | M2 Ultra GPU 60-core (64GB) | M3 Pro GPU 14-core (18GB) | M3 Max GPU 40-core (64GB) | V100 (16GB) | T4 (16GB) |
---|---|---|---|---|---|---|---|---|---|
16 | 128 | 2m 29s | 1m 3s | TBD | TBD | TBD | TBD | 12s | 31s |
64 | 128 | 8m 32s | 2m 57s | TBD | 49s | 2m36s | 1m13s | 41s | 2m |
256 | 128 | 50m 10s | 1h 49m 9s | TBD | TBD | TBD | TBD | - | - |
16 | 512 | 11m 22s | 9m 28s | TBD | TBD | TBD | 1m24s | 47s | 2m 25s |
64 | 512 | 1h 21m 2s | 3h 26m 4s | TBD | TBD | TBD | TBD | - | - |
256 | 512 | 6h 33m 7s | - | TBD | TBD | TBD | TBD | - | - |
Inference:
Batch size | Sequence length | M1 Max CPU (32GB) | M1 Max GPU 32-core (32GB) | M1 Ultra 48-core (64GB) | M2 Ultra GPU 60-core (64GB) | M3 Pro GPU 14-core (18GB) | M3 Max GPU 40-core (64GB) | V100 (16GB) | T4 (16GB) |
---|---|---|---|---|---|---|---|---|---|
16 | 128 | 52s | 16s | 9s | TBD | TBD | TBD | 4s | 10s |
64 | 128 | 3m 2s | 50s | 20s | 47s | 51s | 21s | 13s | 44s |
256 | 128 | 11m 25s | 3m 22s | 76s | TBD | TBD | TBD | 54s | 2m 52s |
16 | 512 | 4m 22s | 1m 1s | 24s | TBD | TBD | 1m2s | 16s | 54s |
64 | 512 | 17m 51s | 3m 59s | 1m 27s | TBD | TBD | TBD | 1m 4s | 3m 24s |
256 | 512 | 1h 10m 41s | 15m 47s | 5m 42s | TBD | TBD | TBD | 4m 10s | 14m 18s |
float64
support and neither fp16
tensor cores.tokenizers
because rust
is missing, do the following:
brew install rustup
rustup-init
source ~/.cargo/env