smpanaro / coreml-llm-cli

CLI to demonstrate running a large language model (LLM) on Apple Neural Engine.
43 stars 3 forks source link

M3 Max Performance #3

Open Proryanator opened 1 month ago

Proryanator commented 1 month ago

I have an M3 Max 14-core CPU/30-core GPU, with 36GB of RAM (with 300GB/s memory bandwidth). I may know someone who has the 16-core variant that I can reach out to since they will probably have better performance since the higher one has 400GB/s memory bandwidth.

swift run -c release LLMCLI --repo-id smpanaro/Llama-2-7b-coreml --max-new-tokens 80
Building for production...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build complete! (0.07s)
ModelPipeline Llama-2-7b-hf (13 chunks)
Compiling models: *************
Loading models  : *************
1 21882 6606 310 14653 431 310 278 16106 315 2696 11248 7738 278 7655 2722 515 607 26935 338 23892 29889 450 1023 1667 6606 7825 5584 18834 630 526 315 2696 11248 508 29872 561 2207 313 13716 29664 29897 322 315 2696 11248 25352 983 313 25822 983 467 13 1576 26935 8024 338 263 2319 3926 12692 14653 431 470 5447 393 338 7531 304 278 21881 12786 310 10557 29892 14325 29892 322 278 2163 5070 29889 450 8024 13880 2319 4796 18281 393 6668 290 297 24554 322 526 1248 1915 630 491 367 267 29889 450 18281 526 5643 491 2654 470 13328 7655 2722 393 1712 278 26935 367 550 

<s> Several species of shrub of the genus Coffea produce the berries from which coffee is extracted. The two main species commercially cultivated are Coffea canephora (robusta) and Coffea arabica (arabica).
The coffee plant is a small evergreen shrub or tree that is native to the tropical regions of Africa, Asia, and the Americas. The plant produces small white flowers that bloom in clusters and are pollinated by bees. The flowers are followed by red or yellow berries that contain the coffee beans

Compile + Load: 4.11 sec
Generate      : 128.44 +/- 2.23 ms / token
                7.79 +/- 0.12 token / sec

swift run -c release LLMCLI --repo-id smpanaro/Llama-2-7b-coreml --max-new-tokens 80
Building for production...
[1/1] Write swift-version--58304C5D6DBC2206.txt
Build complete! (0.07s)
ModelPipeline Llama-2-7b-hf (13 chunks)
Compiling models: *************
Loading models  : *************
1 21882 6606 310 14653 431 310 278 16106 315 2696 11248 7738 278 7655 2722 515 607 26935 338 23892 29889 450 1023 1667 6606 7825 5584 18834 630 526 315 2696 11248 508 29872 561 2207 313 13716 29664 29897 322 315 2696 11248 25352 983 313 25822 983 467 13 1576 26935 8024 338 263 2319 3926 12692 14653 431 470 5447 393 338 7531 304 278 21881 12786 310 10557 29892 14325 29892 322 278 2163 5070 29889 450 8024 13880 2319 4796 18281 393 6668 290 297 24554 322 526 1248 1915 630 491 367 267 29889 450 18281 526 5643 491 2654 470 13328 7655 2722 393 1712 278 26935 367 550 

<s> Several species of shrub of the genus Coffea produce the berries from which coffee is extracted. The two main species commercially cultivated are Coffea canephora (robusta) and Coffea arabica (arabica).
The coffee plant is a small evergreen shrub or tree that is native to the tropical regions of Africa, Asia, and the Americas. The plant produces small white flowers that bloom in clusters and are pollinated by bees. The flowers are followed by red or yellow berries that contain the coffee beans

Compile + Load: 3.29 sec
Generate      : 127.58 +/- 1.56 ms / token
                7.84 +/- 0.09 token / sec
smpanaro commented 1 month ago

Nice! This seems roughly on par with the other M3 Max results in the README, but I'll switch to yours since it has the load time too :)

On twitter, we were hypothesizing that memory bandwidth doesn't help ANE like it does GPU (based on the small difference between M3 and M3 Max) but it would definitely be interesting to see M3 Max 300GB/s vs. M3 Max 400GB/s!

smpanaro commented 1 month ago

Actually, did you happen to see your ANE power usage? (Like in asitop or mactop.) Guessing it will be similar but would rather not mix samples from different computers in the README.

Proryanator commented 1 month ago

Actually, did you happen to see your ANE power usage? (Like in asitop or mactop.) Guessing it will be similar but would rather not mix samples from different computers in the README.

I can re-run it and check asitop just in case 👍 looks like my macbook was using 5.7W (made sure it was on high power mode):

Screenshot 2024-07-25 at 8 50 03 AM
hotellonely commented 1 week ago

@Proryanator hey asitop has been abandoned and can report things wrong :( (i would miss it) probably try pumas! But I would test your code on my 16core/64GB variant and report back

smpanaro commented 1 week ago

asitop, mactop, and pumas all use powermetrics under the covers for ANE, so I believe for the raw Watts we should be safe to compare across tools. (Percentage usage might be a different story.)

smpanaro commented 1 week ago

Also, worth noting that I have updated the model since this post so it should be faster!

Proryanator commented 1 week ago

@Proryanator hey asitop has been abandoned and can report things wrong :( (i would miss it) probably try pumas!

But I would test your code on my 16core/64GB variant and report back

Just curious how do you know that it has been abandoned? The list of issues going back a few years?

hotellonely commented 1 week ago

@Proryanator hey asitop has been abandoned and can report things wrong :( (i would miss it) probably try pumas! But I would test your code on my 16core/64GB variant and report back

Just curious how do you know that it has been abandoned? The list of issues going back a few years?

Because it hasn't been updated for M3 series yet, not even merging the proven fixed pull requests as well. It has been hanging there for a very long time now (last commit was from Jan 2023). ASITOP is purely hard coded for chip models thus when it's wrong, it's very very wrong...