Adding M1-Max result - Githubissues

thipokKub commented 2 years ago

I ran your script and got the following result

My machine is M1-Max 32c 64 GB, approximate Ram usage 25-30 GB

torch 1.12.0.dev20220518
device mps
Epoch: 001/001 | Batch 0000/1406 | Loss: 2.3857
Epoch: 001/001 | Batch 0100/1406 | Loss: 2.4062
Epoch: 001/001 | Batch 0200/1406 | Loss: 2.1027
Epoch: 001/001 | Batch 0300/1406 | Loss: 2.0253
Epoch: 001/001 | Batch 0400/1406 | Loss: 2.1160
Epoch: 001/001 | Batch 0500/1406 | Loss: 1.9523
Epoch: 001/001 | Batch 0600/1406 | Loss: 1.9365
Epoch: 001/001 | Batch 0700/1406 | Loss: 2.3179
Epoch: 001/001 | Batch 0800/1406 | Loss: 1.9971
Epoch: 001/001 | Batch 0900/1406 | Loss: 1.7516
Epoch: 001/001 | Batch 1000/1406 | Loss: 1.8922
Epoch: 001/001 | Batch 1100/1406 | Loss: 1.8546
Epoch: 001/001 | Batch 1200/1406 | Loss: 1.7630
Epoch: 001/001 | Batch 1300/1406 | Loss: 1.8767
Epoch: 001/001 | Batch 1400/1406 | Loss: 1.5391
Time / epoch without evaluation: 42.28 min
Epoch: 001/001 | Train: 0.00% | Validation: 0.00% | Best Validation (Ep. 001): 0.00%
Time elapsed: 48.54 min
Total Training Time: 48.54 min
Test accuracy 0.00%
Total Time: 49.99 min

rasbt commented 2 years ago

Nice! Thanks! I am actually currently updating the benchmarks adding inference speeds, and I will add yours too (with acknowledgement)

rasbt commented 2 years ago

Screen Shot 2022-05-20 at 9 35 19 AM

dangrie158 commented 2 years ago

Maybe interesting for you: I just maxed out my M1 Max with a batch size of 128 and got the following:

I guess a 128GB MacBook could fit a batch size of 256

rasbt commented 2 years ago

Nice, thanks for sharing! I also got slightly faster results with batch size 64 (instead of the original 32) on the M1 Pro. But yeah, would have to increase the batch size for the Nvidia cards and regular M1 then as well to make it a fair comparison.

lutchu commented 2 years ago

Test accuracy 0.00%

Is this expected?

rasbt commented 2 years ago

Yeah it's weird but it happened to me too.

E.g,. on the CPU it's fine:

https://github.com/rasbt/machine-learning-notes/blob/main/benchmark/pytorch-m1-gpu/vgg16-cifar10-results/m1pro-cpu.txt

On the GPU, on the exact same machine with the exact same code, I also got this issue:

https://github.com/rasbt/machine-learning-notes/blob/main/benchmark/pytorch-m1-gpu/vgg16-cifar10-results/m1pro-gpu.txt

thipokKub commented 2 years ago

Could you tried the new nightly build? From https://github.com/pytorch/pytorch/issues/77753, there is a new nightly version which seems to fix the problem (the RAM usage did not grow as much as before)

P.S. This result is deprecated

Batch size 32, Optimizer Adam

torch 1.13.0.dev20220521
device mps
Epoch: 001/001 | Batch 0000/1406 | Loss: 2.5649
Epoch: 001/001 | Batch 0100/1406 | Loss: 2.2909
Epoch: 001/001 | Batch 0200/1406 | Loss: 1.9338
Epoch: 001/001 | Batch 0300/1406 | Loss: 2.1974
Epoch: 001/001 | Batch 0400/1406 | Loss: 1.9835
Epoch: 001/001 | Batch 0500/1406 | Loss: 2.3454
Epoch: 001/001 | Batch 0600/1406 | Loss: 1.9466
Epoch: 001/001 | Batch 0700/1406 | Loss: 2.0661
Epoch: 001/001 | Batch 0800/1406 | Loss: 1.9958
Epoch: 001/001 | Batch 0900/1406 | Loss: 2.0933
Epoch: 001/001 | Batch 1000/1406 | Loss: 1.7824
Epoch: 001/001 | Batch 1100/1406 | Loss: 1.7589
Epoch: 001/001 | Batch 1200/1406 | Loss: 1.8833
Epoch: 001/001 | Batch 1300/1406 | Loss: 2.1066
Epoch: 001/001 | Batch 1400/1406 | Loss: 1.6518
Time / epoch without evaluation: 39.58 min
Epoch: 001/001 | Train: 0.00% | Validation: 0.00% | Best Validation (Ep. 001): 0.00%
Time elapsed: 45.78 min
Total Training Time: 45.78 min
Test accuracy 0.00%
Total Time: 47.23 min

Batch size 128, Optimizer SGD(lr=0.1, momentum=0.9)

torch 1.13.0.dev20220521
device mps
Files already downloaded and verified
Using cache found in /Users/thipok.tham/.cache/torch/hub/pytorch_vision_v0.11.0
Epoch: 001/001 | Batch 0000/0351 | Loss: 2.5026
Epoch: 001/001 | Batch 0100/0351 | Loss: 2.3055
Epoch: 001/001 | Batch 0200/0351 | Loss: 2.3061
Epoch: 001/001 | Batch 0300/0351 | Loss: 2.2998
Time / epoch without evaluation: 29.76 min
Epoch: 001/001 | Train: 0.00% | Validation: 0.00% | Best Validation (Ep. 001): 0.00%
Time elapsed: 34.45 min
Total Training Time: 34.45 min
Test accuracy 0.00%
Total Time: 35.55 min

thipokKub commented 2 years ago

Test accuracy 0.00%

Is this expected?

In compute_accuracy I think the result should be convert to the same device first like the following

correct_pred += (predicted_labels.cpu() == targets.cpu()).sum()

This result in non-zero accuracy (at least for me)

Testing on pytorch-nightly build 1.13.0.dev20220522 Batch size 32, optimizer Adam

torch 1.13.0.dev20220522
device mps
Epoch: 001/001 | Batch 0000/1406 | Loss: 2.6720
Epoch: 001/001 | Batch 0100/1406 | Loss: 2.3715
Epoch: 001/001 | Batch 0200/1406 | Loss: 2.3356
Epoch: 001/001 | Batch 0300/1406 | Loss: 2.0791
Epoch: 001/001 | Batch 0400/1406 | Loss: 1.9815
Epoch: 001/001 | Batch 0500/1406 | Loss: 2.0724
Epoch: 001/001 | Batch 0600/1406 | Loss: 1.9088
Epoch: 001/001 | Batch 0700/1406 | Loss: 2.1451
Epoch: 001/001 | Batch 0800/1406 | Loss: 2.2497
Epoch: 001/001 | Batch 0900/1406 | Loss: 2.1637
Epoch: 001/001 | Batch 1000/1406 | Loss: 2.2672
Epoch: 001/001 | Batch 1100/1406 | Loss: 1.8210
Epoch: 001/001 | Batch 1200/1406 | Loss: 1.7867
Epoch: 001/001 | Batch 1300/1406 | Loss: 1.8080
Epoch: 001/001 | Batch 1400/1406 | Loss: 1.6069
Time / epoch without evaluation: 31.54 min
Epoch: 001/001 | Train: 32.69% | Validation: 32.92% | Best Validation (Ep. 001): 32.92%
Time elapsed: 38.46 min
Total Training Time: 38.46 min
Test accuracy 32.59%
Total Time: 40.00 min

Update

To mimic what is shown in the official blog, I re-run the VGG-16 test with batch size 32, and Adam optimizer

torch 1.13.0.dev20220522
device mps
Epoch: 001/001 | Batch 0000/0703 | Loss: 2.4892
Epoch: 001/001 | Batch 0100/0703 | Loss: 2.6017
Epoch: 001/001 | Batch 0200/0703 | Loss: 2.1060
Epoch: 001/001 | Batch 0300/0703 | Loss: 2.0144
Epoch: 001/001 | Batch 0400/0703 | Loss: 2.0236
Epoch: 001/001 | Batch 0500/0703 | Loss: 2.0377
Epoch: 001/001 | Batch 0600/0703 | Loss: 1.8039
Epoch: 001/001 | Batch 0700/0703 | Loss: 1.9811
Time / epoch without evaluation: 22.25 min
Epoch: 001/001 | Train: 35.16% | Validation: 36.16% | Best Validation (Ep. 001): 36.16%
Time elapsed: 26.50 min
Total Training Time: 26.50 min
Test accuracy 35.95%
Total Time: 27.50 min

The memory usage of just python script is around 28 GB overall. I tried changing the batch size to 128, but there is bottleneck in data processing, and the memory usage still did not exceed 32 GB of ram

So ideally, the speed-up from cpu to mps on M1-Max is about 110.48/22.25 = 4.97 speed up for training, and 8.51/1.00 = 8.51 speed up for inference

Considering the official blog use M1 Ultra with the training performance speed up around 8.x (I'm just eyeballing), this is not that bad (I guess the pattern is M1 Pro ~3x, M1 Max ~5x, M1 Ultra ~8x in training speed up compare to cpu. But that is nothing compare to GTX 1080 Ti cards with ~14x, and RTX 2080 Ti with ~20x, or RTX 3080 with ~16.5x)

P.S. I am referencing the blog's cpu number (which use batch size of 32, which might not be totally correct, but it should be close anyway)

rasbt commented 2 years ago

The original code works fine on the NVIDIA GPU, but good catch, I will give it a try later!

rasbt commented 2 years ago

@thipokKub there was a new nightly release that fixed the RAM leak issue. I was rerunning the experiments and it was actually quite a bit faster (currently putting together the new results). No pressure, but in case you have time and want to run the M1 Max results again, that'd be cool

thipokKub commented 2 years ago

Already did @rasbt 🙂

lenet-mnist

torch 1.13.0.dev20220522
device mps
Epoch: 001/001 | Batch 0000/0421 | Loss: 2.3098
Epoch: 001/001 | Batch 0100/0421 | Loss: 0.2646
Epoch: 001/001 | Batch 0200/0421 | Loss: 0.1437
Epoch: 001/001 | Batch 0300/0421 | Loss: 0.1010
Epoch: 001/001 | Batch 0400/0421 | Loss: 0.0732
Time / epoch without evaluation: 0.16 min
Epoch: 001/001 | Train: 97.33% | Validation: 97.77% | Best Validation (Ep. 001): 97.77%
Time elapsed: 0.22 min
Total Training Time: 0.22 min
Test accuracy 97.39%
Total Time: 0.24 min

mlp-mnist The transformation shape is incorrect here, it should've been (28, 28) instead of (32, 32)

torch 1.13.0.dev20220522
device mps
Epoch: 001/001 | Batch 0000/0421 | Loss: 2.3063
Epoch: 001/001 | Batch 0100/0421 | Loss: 0.3431
Epoch: 001/001 | Batch 0200/0421 | Loss: 0.3089
Epoch: 001/001 | Batch 0300/0421 | Loss: 0.3688
Epoch: 001/001 | Batch 0400/0421 | Loss: 0.3544
Time / epoch without evaluation: 0.09 min
Epoch: 001/001 | Train: 91.74% | Validation: 93.48% | Best Validation (Ep. 001): 93.48%
Time elapsed: 0.13 min
Total Training Time: 0.13 min
Test accuracy 92.19%
Total Time: 0.15 min

rasbt commented 2 years ago

Oh, how did I not see that 😅! Thanks!

rasbt commented 2 years ago

Alright, just updated the results at https://sebastianraschka.com/blog/2022/pytorch-m1-gpu.html

thipokKub commented 2 years ago

pytorch 1.13.0.dev20220610 seems to get significantly faster (-10 minutes) Some observation are

I saw huge spike in GPU power consumption when plugging in using 94w charger (change from ~21w to ~28w on average)
The memory bandwidth seems to increase from ~45 GB/s to ~75 GB/s

torch 1.13.0.dev20220610
device mps
Epoch: 001/001 | Batch 0000/1406 | Loss: 2.5373
Epoch: 001/001 | Batch 0100/1406 | Loss: 2.0927
Epoch: 001/001 | Batch 0200/1406 | Loss: 2.1096
Epoch: 001/001 | Batch 0300/1406 | Loss: 2.0650
Epoch: 001/001 | Batch 0400/1406 | Loss: 1.8195
Epoch: 001/001 | Batch 0500/1406 | Loss: 1.9852
Epoch: 001/001 | Batch 0600/1406 | Loss: 2.0264
Epoch: 001/001 | Batch 0700/1406 | Loss: 1.9916
Epoch: 001/001 | Batch 0800/1406 | Loss: 1.9276
Epoch: 001/001 | Batch 0900/1406 | Loss: 1.8869
Epoch: 001/001 | Batch 1000/1406 | Loss: 2.0278
Epoch: 001/001 | Batch 1100/1406 | Loss: 1.9551
Epoch: 001/001 | Batch 1200/1406 | Loss: 1.7823
Epoch: 001/001 | Batch 1300/1406 | Loss: 1.7606
Epoch: 001/001 | Batch 1400/1406 | Loss: 1.9401
Time / epoch without evaluation: 20.43 min
Epoch: 001/001 | Train: 33.36% | Validation: 33.16% | Best Validation (Ep. 001): 33.16%
Time elapsed: 24.89 min
Total Training Time: 24.89 min
Test accuracy 33.11%
Total Time: 25.96 min

I am not sure, but maybe with macOS ventura release (with Metal 3), the performance can be further improved?

rasbt commented 2 years ago

Wow thanks! I should do an update of this some time! Haha, I may wait until the macOS Ventura release in Fall though -- too many open projects atm 😅

jonasmerlin commented 1 year ago

Are there any updates on the performance of the M1 Max with Ventura and newer versions of PyTorch?

Edit: Forgot to say thanks for your original post to begin with! Already very helpful.

rasbt commented 1 year ago

I haven't had a chance to rerun the experiments on newer pytorch versions

callowaysutton commented 4 months ago

Update:


torch 2.4.0.dev20240501
device mps
Epoch: 001/001 | Batch 0000/1406 | Loss: 2.4635
Epoch: 001/001 | Batch 0100/1406 | Loss: 2.1030
Epoch: 001/001 | Batch 0200/1406 | Loss: 1.9349
Epoch: 001/001 | Batch 0300/1406 | Loss: 1.9248
Epoch: 001/001 | Batch 0400/1406 | Loss: 2.4663
Epoch: 001/001 | Batch 0500/1406 | Loss: 1.6643
Epoch: 001/001 | Batch 0600/1406 | Loss: 1.8012
Epoch: 001/001 | Batch 0700/1406 | Loss: 1.9272
Epoch: 001/001 | Batch 0800/1406 | Loss: 1.7155
Epoch: 001/001 | Batch 0900/1406 | Loss: 1.9375
Epoch: 001/001 | Batch 1000/1406 | Loss: 1.9579
Epoch: 001/001 | Batch 1100/1406 | Loss: 1.7168
Epoch: 001/001 | Batch 1200/1406 | Loss: 1.6184
Epoch: 001/001 | Batch 1300/1406 | Loss: 1.9197
Epoch: 001/001 | Batch 1400/1406 | Loss: 1.6845
Time / epoch without evaluation: 18.25 min
Epoch: 001/001 | Train: 34.86% | Validation: 36.02% | Best Validation (Ep. 001): 36.02%
Time elapsed: 21.98 min
Total Training Time: 21.98 min
Test accuracy 35.97%
Total Time: 22.90 min

rasbt commented 4 months ago

Thanks for sharing! I am surprised to see the decline in prediction accuracy here similar to what the previous code had above. Maybe it's a PyTorch version change that triggered that; maybe need to investigate and try different hparam settings here.

rasbt / machine-learning-notes

Adding M1-Max result #3

Update