Apple M1 chip is much better at everything but very slow in chess

lex312 commented 3 years ago

What are the reasons? How many things in detail cause this slowdown? -M1 chip -macOS Big Sur -Stockfish engine (source code)

Any ideas how to fix the one or another problem? Would it be possible to use other parts from the chip to speed up Stockfish?

Looking here they have found some problems and things which could be improved: https://forums.macrumors.com/threads/apple-m1-cpu-gpu-speed-is-very-disappointing.2293062/

ASM-Master commented 3 years ago

All 3 reasons. Stockfish engine (source code) doesn't support macOS or Apple Silicon. Stockfish developers don't care about macOS or Apple Silicon, even if it is faster. They would rather support lower market share Linux, and slower Intel 🤷

It is possible to use other parts from the chip, such as Apple AMX2, but you can forget that ever happening, since they won't even support macOS in the first place. Instead they rely on third parties such as Daylen Yang, a Google engineer to update the app every now and then 🤦

ribbit-prog commented 3 years ago

Let's get real there...

Special compiles of Stockfish, cFish, etc. are already optimizing compiles for using M1:s NEON, etc. But they are still less than half the speed of similar priced, similar-sized, computers. The fact of the matter is that M1 isn't that fast as the usual influencer-types (fanbois) make it out to be. A similar priced modern CPU from AMD runs circles around it. For certain use-cases, it may be "ok" for its "watt" but let's keep it real.. the CPU is faster on Apple PowerPoint.-presentations than it is in real-life performance. It's more or less a glorified std. ARM big-little phone CPU with a focus mainly on the low-power slow "little"-cores and relies heavily on optimized code to even be comparable to intel/AMD these days. The CPU is overrated and underperforming.. (not only for chess). Just compare it with amazing new stuff like the AMD 5700G and realize that anyone looking for "real" performance of CPU+GPU for the dollar should look elsewhere than the fruity un-open company these days.

noobpwnftw commented 3 years ago

Let's get more real...

There is even support for https://github.com/official-stockfish/Stockfish/commit/b62af7ac1e78c1b35103dfe6110201d0b810aee0 which nobody I know has ever heard of.

Or if Apple wish to donate 2000 cores of whatever CPU architecture they want us to optimize for on fishtest, showcasing their real performance capabilities, people will try to make the most of them without a doubt and it will be mutually benefitial.

ASM-Master commented 3 years ago

Let's get even more real...

Yes Stockfish does support NEON but that's about it, all the optimizations goes towards x86 SIMD such as AVX and BMI. Show me a fanless laptop that has 18 hours of battery, with great speakers, microphones, keyboard, P3 wide-color gamut display, tops most benchmarks compared to other laptops in its class, all in a slim portable design for $999. You can go to any Youtube channel or Websties and you can see that the M1 smokes all other laptops in its class, when it comes to photo editing, video editing, code compilation, etc. so please tell me if these aren't "real-life" performance, exactly what is?? I can tell your a delusional fanboy from the fact that you said its "ok" for performance / watt. LMAO, how brainwashed do you need to be? Name 1 laptop, that has better performance / watt, I'll wait.... I've had android phones all my life, and still use a windows desktop, but can still give credit where credit is due, and admit apple did a great job with its m1 processor. I don't like their business practices, but can still admit that they have one of the best semiconductor design team in the world.

vondele commented 3 years ago

very simple reason... nobody made a PR for M1. Simply do it..

noobpwnftw commented 3 years ago

Any smartphone capable of running SF will probably have a better performance per watt and that's just a simple fact. As for raw NPS benchmark numbers, the laptop isn't simply that fast compared to many alternatives especially if you do away with fanless design, battery life, front and rear cameras, touch pad, great speakers, microphones, keyboard, P3 wide-color gamut display and that's another simple fact.

So as far as I understand, SF: ~~1) complies native on the chip architecture~~ ~~2) makes use of its SIMD capabilities~~ ~~3) runs with fairly reasonable performance~~ 4) works

So what kind of optimizations are you looking for?

gekkehenker commented 3 years ago

Let's get even more real...

Yes Stockfish does support NEON but that's about it

Neon is also just not that good. SVE2 will change that, but your M1 doesn't support that.

If you want better support for its instruction set you'll have to code it yourself.

And like with other ARM chips, it does simple instructions incredibly well but starts losing ground with more complex ones.

Sopel97 commented 3 years ago

I will optimize for M1 if you buy me an M1 computer

lex312 commented 3 years ago

I hope that you have seen WWDC21: https://www.apple.com/apple-events/june-2021/ Apple have a lot improved and new coding stuff. It should make Stockfish on M1 maybe 3 to 10 times faster ;-)

MichaelB7 commented 3 years ago

The Stockfish community would welcome any developer to come in to help with coding for the M1. For whatever reason, most of them do not seem be interested, which is a little odd since over 10% of the world play chess. How about our friends in Japan ,handing us NNUE, the opportunity is there for anyone to come join us. We are all here pro bono and we are doing what we like to do to help out. That's how it works.

Sopel97 commented 3 years ago

The Stockfish community would welcome any developer to come in to help with coding for the M1. For whatever reason, most of them do not seem be interested, which is a little odd since over 10% of the world play chess. How about our friends in Japan ,handing us NNUE, the opportunity is there for anyone to come join us. We are all here pro bono and we are doing what we like to do to help out. That's how it works.

Interest is only one thing. Apple actively blocks anyone not owning apple products from doing any form of developement for apple. Which means that any interested party must own the apple's hardware to develop for it.

vondele commented 3 years ago

so, looking for solutions, @domschl added the initial apple silicon support, maybe he has a chance to look for speedups on M1?

noobpwnftw commented 3 years ago

Does it have any form of documentation on available instructions and how fast are they? Not gonna work with a black box.

vondele commented 3 years ago

probably here https://documentation-service.arm.com/static/60119835773bb020e3de6fee?token=. (8538 pages)

IIUC M1 is Armv8.5 and the SIMD docs should be on page 1503

Sopel97 commented 3 years ago

I looked at it some time ago, it's not processor specific and doesn't contain any information relevant for development.

domschl commented 3 years ago

Currently the apple-silicon flavor of stockfish just uses NEON. There's no optimization yet that uses the additional hardware available on M1 (e.g. the rumored matrix-multiplier or the neural engine). A way to use that could be by accessing Apple's accelerate framework (will be shown in today's lesson at WWDC). So there definitely is potential, lot's of unknowns, and for sure no easy task.

That being said, why do you think M1 is that slow? Some quick benches with current master build:

Mac mini M1
===========================
Total time (ms) : 2230
Nodes searched  : 5530620
Nodes/second    : 2480098

iMac (Retina 5K, 27-inch, 2017, 3,8 GHz Quad-Core Intel Core i5)
===========================
Total time (ms) : 3219
Nodes searched  : 5530620
Nodes/second    : 1718117

MacBook Pro (13-inch, 2020, Four Thunderbolt 3 ports, 2 GHz Quad-Core Intel Core i5)
===========================
Total time (ms) : 4196
Nodes searched  : 5530620
Nodes/second    : 1318069

So even without any 'secret sauce'-knowledge, the M1 is considerably faster than comparable Apple Intel systems.

Sopel97 commented 3 years ago

@domschl what command did you use

domschl commented 3 years ago

I did build Stockfisch from master-branch with ARCH=apple-silicon for the M1 and ARCH=x86-64-modern for the Intel macs. Then from command line:

$ stockfisch bench

That’s not a sophisticated benchmark, but just gives a quick first impression of performance-relations between systems.

NightlyKing commented 3 years ago

Taking an average speed of multiple benchmark runs is a very reliable predictor of elo gains for non-functional speedup patches so I'd challenge your claim that it isn't a good benchmark.

@ASM-Master Stockfish may not be fully optimized for M1 processors - but even if it was it would still be rather underwhelming - as in you'll have to realize it's not "much better at everything". That's to be expected from RISC chips trying to run programs heavily using complex instructions. The title "Apple M1 chip is much better at everything" is, frankly speaking, delusional. We can agree that SF isn't optimized for M1 chips - but can we please let the "my preferred brand is better than your preferred brand" mentality behind? Either submit a patch that improves the situation or don't. This project relies on pro bono work and people like to work on what they are good at.

Sopel97 commented 3 years ago

@domschl what version of stockfish? That is not possible with the current master on one thread. Just hard for me to believe these are correct.

domschl commented 3 years ago

I've checked out stockfish master today for all three tests. What's unusual about it?

$ ./stockfish 
Stockfish 110621 by the Stockfish developers (see AUTHORS file)

For the M1 test I have used the latest betas of macOS 12 and Clang 13:

clang --version
Apple clang version 13.0.0 (clang-1300.0.18.6)
Target: arm64-apple-darwin21.0.0
Thread model: posix

Intel versions was with release clang 12 and macOS 11.

Sopel97 commented 3 years ago

@domschl there's no cpu currently that would give 2480098 on a single core without overclocking. Just hard for me to believe these are correct.

NightlyKing commented 3 years ago

@Sopel97 My 5800X without any OC done by me (default mainboard settings) gets 2.6+ mn/s but that's a much more beefy CPU than the M1 chips.

domschl commented 3 years ago

noobpwnftw commented 3 years ago

AMD Ryzen Threadripper 3990X 64-Core Processor So, about as fast?

noobpwnftw commented 3 years ago

One core that is, still good to know that it is just as fast. Then why is all the whining about it being slow?

domschl commented 3 years ago

Being slow? I don't know. Misunderstanding?

Multithreading performance (4 performance cores 4 lp-cores):

uci
...
uciok
setoption name Hash value 10000
setoption name Threads value 8
go

Not too bad for a mobile CPU that is also in current iPad.

vondele commented 3 years ago

so, let me close the issue. This looks good. If somebody comes up with better code, specific for M1, we'll happily take a look.

domschl commented 3 years ago

@vondele: In my last remark I made a mistake (edited now): I wrote multi-tasking performance was 9-10Mnps that was a misreading, the shot shows 0.9-1Mnps.

I did some more testing: the performance on M1 drops drastically, if too much memory is requested for Hash: macOS seems to virtualize memory, if more then 5-6GB hash is requested on a 16GB machine. I hash is about 2-4GB, performance is around nps 5735512 hashfull 987, so 5-6Mnps. That's not revolutionary, but still better than my intel machines.

So one reason for perceived bad performance with M1 can be requesting too much hash memory.

osteslag commented 3 years ago

Curiously, @domschl, this is the bench for sf_14, nn-3475407dc199 on my iPhone 12 Pro:

Total time (ms) : 8085
Nodes searched  : 4770936
Nodes/second    : 590097

Note that nodes searched is only 4770936, compared to 5530620 in your case. But still fairly good numbers on a handheld device, IMO.

vondele commented 3 years ago

@domschl that's interesting, so maybe macos doesn't have linux' transparent huge pages or similar? For windows we have specific code in misc.cpp (aligned_large_pages_alloc_windows).

Edit: how much is the nps drop if the hash is too large?

domschl commented 3 years ago

I've encountered other cases with Apple M1 memory allocation / performance degradation issues and filed a bug report with apple. Investigation (and possible mitigations) is still ongoing with them. I'll update here as soon as I've learned something new.

Am So., 15. Aug. 2021 um 17:55 Uhr schrieb Joost VandeVondele < @.***>:

@domschl https://github.com/domschl that's interesting, so maybe macos doesn't have linux' transparent huge pages or similar? For windows we have specific code in misc.cpp (aligned_large_pages_alloc_windows).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/official-stockfish/Stockfish/issues/3529#issuecomment-899070597, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLDKAZ4ZVN75LHPXMYIO2LT47PPVANCNFSM46DSHFWQ .

maximmasiutin commented 1 year ago

Intel is faster for StockFish even on old generation ultra-low-power thin notebook processors such as Intel i7-1065G7 released in 2019. The advantage keeps even when StockFish is compiled for plain "x86-64", i.e. without intrinsics or vector operations or other CPU-specific commands. Maybe, Apple M1 is better at very low power consumption levels, but for a Mac Mini which is connected to power and has 39W max power consumption, power saving is not an issue, therefore, at power levels as high as 39W, Intel i7-1065G7 clearly outperforms Apple M1 in combined performance, because on that Intel, all 8 threads are fast, while on Apple M1 only 4 threads are fast, while the other 4 are slow..

Command: bench 12288 8 2000 default movetime Nodes/second:

Mac mini M1 16G RAM, "apple-silicon"          :  3667533
Intel i5-12500, "x86-64"                      : 16941748
Intel i7-1065G7 thin laptop, "x86-64-vnni256" :  7168818
Intel i7-1065G7 thin laptop, "x86-64"         :  6059999

Update: @domschl correctly pointed out that I allocated too much memory for Stockfish which caused slowdowns. When I allocated less memory, the speed improved sixfold, see https://github.com/official-stockfish/Stockfish/issues/3529#issuecomment-1467027614

gsobala commented 1 year ago

Well I just did that on a M1 Macbook Pro and got 19409471, pulling 40W.

maximmasiutin commented 1 year ago

My Mac mini M1 has 4 fast cores and 4 slow cores. When I run Stockfish on Mac mini with 8 threads, the overall NPS (Nodes per second) performance was lower than on Intel processors which had just 4 fast cores with hyperthreading (8 threads). But if I run Stockfish on that Mac mini with just 4 threads, it uses only fast cores, and per-core performance is higher than that of Intel, see the attached screenshot, third line (OS: Darwin). It is in Top 3 out (of 306 nodes) at FishTest https://tests.stockfishchess.org/tests

darwin

It is more than 2 million nodes per second per core, which outperforms most of Intel processors on that metric.

maximmasiutin commented 1 year ago

@gsobala - this number (19409471) at 40W is quite impressive!

domschl commented 1 year ago

As already mentioned: the bad benchmark results for M1 mini can be caused by trying to allocate more hash than macOS is willing to give as physical memory, resulting in virtual memory being used for hash, which causes very bad performance. Simply decrease hash size for M1, and check 'activity monitor' for VM usage. Decreasing hash until only physical memory is used improves performance drastically.

Sopel97 commented 1 year ago

https://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php were done with 1GB of hash, though before some neon optimizations have been made. Default bench uses much less. I find it unlikely that someone would use more for a benchmark. The bench from maximmasiutin is high but within limit, if the OS is the issue then it it the issue, the RAM is there.

gsobala commented 1 year ago

Using the latest Stockfish build I get 14511422 on a 10-thread apple-silicon test as per ipmanchess's settings, a marked improvement from the 12500000 in the table.

The key between these benches and those quoted in the thread above is that ipmanchess tests nnue performance whilst the default stockfish bench is of course mixed.

Sopel97 commented 1 year ago

Mixed bench is generally useless because it uses setoption name use nnue value false for half of the positions, and that's an ancient use-case. NNUE bench is more in line with practical performance. Ipman's bench command is a good, practical benchmark, though obviously on a bit dated version since which many changes were made.

maximmasiutin commented 1 year ago

As already mentioned: the bad benchmark results for M1 mini can be caused by trying to allocate more hash than macOS is willing to give as physical memory, resulting in virtual memory being used for hash, which causes very bad performance. Simply decrease hash size for M1, and check 'activity monitor' for VM usage. Decreasing hash until only physical memory is used improves performance drastically.

@domschl - I had 16G physical memory, so allocated 12G for the hash. As you suggested, I re-run the tests with the same command except the hash size set to 0.5G, even though the "hashfull" never reached 1000 (100%). To be specific, the command was "bench 512 8 2000 default movetime". The results were the following on that Mac mini:

Total time (ms) : 90738
Nodes searched  : 1503123231
Nodes/second    : 16565531

That time, performance rose significantly, outperforming Intel i7-1065G7 laptop by more than twice, and so the speed was almost the same as for the Intel i5-12500 desktop CPU which had higher TPP (base - 65 W, max turbo: 117 W - and that is for the CPU only, not counting the chipset, etc...)

Thank you for your tip!

frefrik commented 1 year ago

M2 Pro, 12‑core CPU, 32GB memory. Using command bench 512 8 2000 default movetime:

Total time (ms) : 90548
Nodes searched  : 2218332923
Nodes/second    : 24498972

H1a8 commented 8 months ago

As already mentioned: the bad benchmark results for M1 mini can be caused by trying to allocate more hash than macOS is willing to give as physical memory, resulting in virtual memory being used for hash, which causes very bad performance. Simply decrease hash size for M1, and check 'activity monitor' for VM usage. Decreasing hash until only physical memory is used improves performance drastically.

@domschl - I had 16G physical memory, so allocated 12G for the hash. As you suggested, I re-run the tests with the same command except the hash size set to 0.5G, even though the "hashfull" never reached 1000 (100%). To be specific, the command was "bench 512 8 2000 default movetime". The results were the following on that Mac mini:
Total time (ms) : 90738
Nodes searched  : 1503123231
Nodes/second    : 16565531
That time, performance rose significantly, outperforming Intel i7-1065G7 laptop by more than twice, and so the speed was almost the same as for the Intel i5-12500 desktop CPU which had higher TPP (base - 65 W, max turbo: 117 W - and that is for the CPU only, not counting the chipset, etc...)

Thank you for your tip!

I believe you used the wrong command. You left out "nnue". Also you should use 12cores, not 8. The command should be bench 512 12 26 default depth nnue

Can you please try that and report back. Make sure you use the file to be consistent with the ipman chess bench ranking list.

Stockfish 14.1 M1 pop-neon Can be found on this site under Apple/Mac: https://ipmanchess.yolasite.com/amd--intel-chess-bench-stockfish.php

official-stockfish / Stockfish

Apple M1 chip is much better at everything but very slow in chess #3529