official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.36k stars 2.25k forks source link

Stockfish binary distributables question #3588

Closed ghost closed 3 years ago

ghost commented 3 years ago

According to the official Download Stockfish for Windows page https://stockfishchess.org/download/windows/ the Windows binary "stockfish_13_win_x64_modern.zip" would be compiled as "SSE4.1 + POPCNT: Intel processors after ~2008, AMD processors after ~2011". However, inside the binary itself the information states that SSE41 support was not enabled in the Makefile during the build process, only 64bit SSSE3 SSE2 POPCNT flags were used:

c:\temp\>stockfish_13_win_x64_modern.exe
Stockfish 13 by the Stockfish developers (see AUTHORS file)
compiler

Compiled by g++ (GNUC) 7.3.0 on MinGW64
Compilation settings include:  64bit SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 7.3-posix 20180312

In my humble opinion, either the Web page should read 🐇 SSSE3 + POPCNT: Intel processors after ~2008, AMD processors after ~2011 or the build process has to be corrected to use SSE41 flag as well. (currently this flag is only set in BMI2 and AVX2 builds).

NightlyKing commented 3 years ago

SSE4.1 isn't a speedup over SSE3 in our use case. It should be removed until someone actually finds a use for it. Website could be updated.

ghost commented 3 years ago

I just had a quick look at the current code and the only places where USE_SSE41 is used are in src/nnue/*. But only when USE_AVX2 is not set.

However, in the only two 64-bit build configurations where USE_SSE41 is currently set (Windows x64 for Haswell CPUs and Windows x64 for modern computers + AVX2), the flag USE_AVX2 is also set at the same time.

This means that the USE_SSE41 flag - for 64-bit at least - does not produce any code, as it is virtually overwritten by USE_AVX2.

Note: As far as I can see, there is just one 'x86-32-sse41-popcnt' build target defined in the Makefile, where USE_SSE41 is then actually used. By the way, this is not the generic '32-bit: Maximally compatible but slow' x86-32 build offered on the download page.

joergoster commented 3 years ago

Not sure I understand, but my modern build states it uses SSE41.

Compiled by g++ (GNUC) 9.3.0 on Linux
Compilation settings include:  64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 9.3.0
ghost commented 3 years ago

The point I am making is that the buildflag SSE41 is set in the Makefile, but it is completely useless in all 64-bit builds. Therefore it might as well be removed in all 64-bit configurations from the Makefile, as I suggested here https://github.com/official-stockfish/Stockfish/pull/3589. The resulting binaries will be identical in my opinion (except for where the compiler flag "SSE41" is printed out on the console).

joergoster commented 3 years ago

So these specializations are no longer needed? https://github.com/official-stockfish/Stockfish/blob/master/src/nnue/nnue_feature_transformer.h#L281-L283 https://github.com/official-stockfish/Stockfish/blob/master/src/nnue/nnue_feature_transformer.h#L281-L283

ghost commented 3 years ago

As I understand it, they will never be used in 64-bit targets, but there is currently one 32-bit target where they are used: 'x86-32-sse41-popcnt'. So I would not remove them from 'nnue_feature_transformer.h' just yet.

Sopel97 commented 3 years ago

Joerg is correct.

This means that the USE_SSE41 flag - for 64-bit at least - does not produce any code, as it is virtually overwritten by USE_AVX2.

That's not true. There are processors where AVX2 is not available but SSE4.1 is.

JavaMast commented 3 years ago

Screenshot_152 Screenshot_153

Intel Core i5 760

bench1 bench-2

Intel Core i5-7600K bench-3 bench-4

Intel E8500 bench-5

Intel Core i7-720QM bench-6

ghost commented 3 years ago

Before we go into the question of what combination of flags anyone can compile on their own development machine, let's stick with the official Stockfish 13 binaries. Originally the question about the flags referred to the official Stockfish 13 binary distributables, which users worldwide will get from the official download sites:

To make it clear what I mean, I've written out the actual compiler flag combinations that these binaries were actually built with (that's what the binaries themselves report):

Compiled by g++ (GNUC) 7.5.0 on Linux
__VERSION__ macro expands to: 7.5.0

stockfish_13_linux_x64_bmi2:   64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_avx2:   64bit      AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_modern: 64bit                 SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_ssse:   64bit                 SSSE3 SSE2
stockfish_13_linux_x64:        64bit                       SSE2

Compiled by g++ (GNUC) 7.3.0 on MinGW64/MinGW32
__VERSION__ macro expands to: 7.3-posix 20180312

stockfish_13_win_x64_bmi2:     64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_win_x64_avx2:     64bit      AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_win_x64_modern:   64bit                 SSSE3 SSE2 POPCNT
stockfish_13_win_x64_ssse:     64bit                 SSSE3 SSE2
stockfish_13_win_x64:          64bit                       SSE2
stockfish_13_win_32bit:        32bit MMX

Now, do we agree on the fact that all four occurrences of SSE41 in the officially created binaries for Linux and Windows are nonsense, because those are the AVX2 versions in the first place and setting SSE41 is useless in that case (see actual Stockfish source code, as explained above)?

Sopel97 commented 3 years ago

Perhaps we should consider dropping even more archs? AVX2 is supported pretty much since 2011. There is only a few processors that support SSSE3 but don't support popcount. We could reduce the set to 32-bit, 64-bit, [SSE2, ]SSSE3+POPCNT, AVX2, BMI2, VNNI256, AVX512, VNNI512. All this with transitivity, so no builds with POPCNT and not SSSE3, or with 32bit see support. The performance <AVX2 is not very good so we shouldn't care much either way, these people don't care about performance.

ghost commented 3 years ago

I find it particularly funny, by the way, that it's the versions that don't use AVX2 and don't use SSE41 that are called "modern". After all, these are the CPUs that are over 10 years old.

JavaMast commented 3 years ago

@Sopel97 BTW, VNNI builds is very slow now https://github.com/official-stockfish/Stockfish/issues/3457

Sopel97 commented 3 years ago

@Sopel97 BTW, VNNI builds is very slow now

3457

I'm aware of that, and I have some understanding why that is. Fixing it is an option but requires a partial revert that would blow the size of the code a bit. It's on my radar but I want to get the current drama, and the overall ARCH situation settled.

Sopel97 commented 3 years ago

AVX2 is supported pretty much since 2011.

2013, and even then, not by Intel. They're still actively making processors with no AVX support of any kind.

that falls into "we don't care whether it's fast for these people" bucket

Sopel97 commented 3 years ago

You're trying to push it to an extreme in two ways here to support your argument.

  1. "only works well". The differences between SSSE3 and SSE4 are small. SSSE3 would still be supported so I see no issue. there is only like 1 obscure AMD cpu that required stockfish_13_win_x64_ssse build. MMX is obsolete.
  2. you think people who can reach 2M nps at most care about whether it's 2.2M or 1.8M? Most of them is probably using lichess with 300k nps or whatever anyway.

btw. steam hardware survey from may 2021 https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam obraz obraz

I'm being very generous with wanting to support SSE2 and 32bit. I'm the 2% minority in both of these tables.

ghost commented 3 years ago

Here is another "the 100 most common CPUs" statistic (as of 27-Jun-2021) from https://cpu.userbenchmark.com/

  # CPU                   Mkt.     Age
                          share %  months

  1 AMD Ryzen 5 3600      4.5      23
  2 AMD Ryzen 7 3700X     2.93     23
  3 AMD Ryzen 5 5600X     2.51      7
  4 AMD Ryzen 7 5800X     1.88      7
  5 AMD Ryzen 5 2600      1.78     38
  6 Intel Core i7-10750H  1.45     16+
  7 Intel Core i7-9700K   1.43     32
  8 AMD Ryzen 9 5900X     1.34      7
  9 Intel Core i9-9900K   1.3      32
 10 Intel Core i7-8700K   1.23     44
 11 Intel Core i5-9400F   1.19     28
 12 Intel Core i7-10700K  1.13     13
 13 AMD Ryzen 7 2700X     1.14     38
 14 AMD Ryzen 9 3900X     1.05     23
 15 Intel Core i7-9750H   1.01     26+
 16 Intel Core i7-7700HQ  1.0      54+
 17 AMD Ryzen 5 3600X     0.96     23
 18 Intel Core i5-10400F  0.96     13
 19 Intel Core i7-8750H   0.91     39+
 20 Intel Core i7-7700K   0.93     53
 21 Intel Core i7-6700K   0.92     70
 22 Intel Core i5-9600K   0.83     32
 23 Intel Core i7-8700    0.82     44
 24 AMD Ryzen 5 2600X     0.71     38
 25 Intel Core i5-8250U   0.65     45+
 26 Intel Core i5-10300H  0.66     17+
 27 Intel Core i7-3770    0.64    111
 28 Intel Core i7-4790K   0.67     84
 29 Intel Core i7-4790    0.64     85
 30 Intel Core i5-7200U   0.62     59+
 31 AMD Ryzen 7 3800X     0.62     23
 32 Intel Core i9-10850K  0.62     10+
 33 AMD Ryzen 9 5950X     0.62      7
 34 Intel Core i5-8400    0.66     44
 35 Intel Core i7-7700    0.6      53
 36 Intel Core i7-6700HQ  0.58     69+
 37 Intel Core i5-7400    0.55     53
 38 Intel Core i7-6700    0.58     69
 39 Intel Core i5-3470    0.54    110
 40 Intel Core i9-10900K  0.56     13
 41 AMD Ryzen 7 4800H     0.54     16+
 42 AMD Ryzen 5 3400G     0.54     24+
 43 AMD Ryzen 7 5800H     0.51      5+
 44 Intel Core i5-6500    0.48     69
 45 Intel Core i7-10700   0.49     13
 46 Intel Core i5-4460    0.5      85
 47 Intel Core i5-9300H   0.46     27+
 48 AMD Ryzen 3 3200G     0.48     24+
 49 Intel Core i5-1135G7  0.46     9+
 50 AMD FX-8350           0.45    104
 51 AMD Ryzen 5 1600      0.48     50
 52 AMD Ryzen 5 4600H     0.45     14+
 53 Intel Core i5-8300H   0.46     41+
 54 Intel Core i7-1165G7  0.45     12+
 55 Intel Core i7-11700K  0.45      3
 56 Intel Core i5-6600K   0.43     70
 57 Intel Core i5-10400   0.44     13
 58 AMD Ryzen 3 2200G     0.41     40+
 59 Intel Core i5-10600K  0.42     13
 60 Intel Core i7-8550U   0.41     46+
 61 Intel Core i7-2600    0.4     125
 62 Intel Core i5-4590    0.36     83
 63 Intel Core i3-9100F   0.41     26
 64 AMD FX-6300           0.4     104
 65 Intel Core i5-2400    0.4     125
 66 Intel Core i5-8600K   0.4      44
 67 Intel Core i7-4770    0.38     99
 68 AMD Ryzen 5 3500U     0.4      26+
 69 Intel Core i5-6200U   0.37     68+
 70 Intel Core i7-10700F  0.37     12+
 71 Intel Core i7-7500U   0.38     57+
 72 Intel Core i5-4570    0.35     98
 73 AMD Ryzen 7 2700      0.38     38
 74 AMD Ryzen 5 1600AF    0.37     34+
 75 Intel Core i5-6400    0.36     69
 76 Intel Core i5-1035G1  0.35     21+
 77 Intel Core i5-8265U   0.34     34+
 78 AMD Ryzen 5 3500X     0.34     20
 79 Intel Core i5-7300HQ  0.34     53+
 80 AMD Ryzen 5 3550H     0.32     28+
 81 Intel Core i5-6300U   0.3      73+
 82 Intel Core i5-4690K   0.31     85
 83 Intel Core i5-10210U  0.31     24+
 84 Intel Core i7-9700    0.32     24+
 85 Intel Core i5-7500    0.3      53
 86 Intel Core i7-10700KF 0.29     12+
 87 AMD Ryzen 9 5900HS    0.28      4+
 88 Intel Core i5-5200U   0.27     77+
 89 Intel Core i5-7600K   0.27     53
 90 Intel Core i7-10870H  0.27      8+
 91 Intel Core i7-10875H  0.28     14+
 92 Intel Core i7-10510U  0.26     21+
 93 AMD Ryzen 3 3100      0.27     13+
 94 AMD Ryzen 5 2400G     0.27     40+
 95 AMD Ryzen 7 1700      0.26     51
 96 Intel Core i5-3570    0.26     90+
 97 Intel Core i7-3770K   0.25    110
 98 Intel Core i9-11900K  0.26      3
 99 Intel Core i5-3570K   0.26    110
100 Intel Core i5-11600K  0.26      2
ghost commented 3 years ago

There is another problem with the current Stockfish binary in this context. It is absolutely static and crashes on me without any error message when I start e.g. a BMI2 version on an Intel Core i5-3450 @ 3.10GHz (Ivy Bridge). Only in the debugger, you can see what's going on: Unhandled exception at 0x0000013F1D32B5 in stockfish.exe: 0xC000001D: Illegal Instruction. A "normal user" would be completely overwhelmed by this. A program simply crashes without at least displaying a message as to why. No software should behave like this in 2021, in my humble opinion. It's really not that hard to check the processor capabilities at startup, i.e. at runtime (and not at compile time), this very simple Github project shows how to do it: https://github.com/Mysticial/FeatureDetector

ghost commented 3 years ago

So if I were to think this through and wish for something, this old-fashioned #ifdef #else NNUE C-programming style from the 80's should soon be replaced with real C++ concepts like Strategy Design Patterns as far as the different CPU configurations are concerned. It's already a mess, how else to maintain it in the future? I could make a PR proposal, but I'm not sure if it would be widely accepted and worth the effort. I'm new to this Stockfish community and can't really assess how it works right now...

NightlyKing commented 3 years ago

I think a single binary that is the fastest on all CPU configs would be very neat. It would certainly avoid confusion for people that just use it as a tool for human v human chess and probably don't know which CPU they have. If you make a PR, keep in mind that on Zen and Zen2 bmi2 is a slowdown while on Zen3 it is natively implemented and is faster. Also, during compilation there should be an argument to restrict it to say bmi2 and 'lower' since most CPUs are incapable to profile-build (pgo) AVX512.

ghost commented 3 years ago

I think a single binary that doesn't crash on any CPU and unleashes the best capabilities of Stockfish without silly user queries would simply be in the spirit of the times and what to expect in 2021. I could suggest a PR for this binary, but also don't want to waste my time when someone who will end up deciding this is already signaling "nah, we don't really need that...".

Could someone who has to decide this please comment? My time is valuable to me... Thanks!

Sopel97 commented 3 years ago

This would require the compiler to support all archs being compiled for, separate cpp files for things that need to be compiled with -mavx2, -mavx512 and so on, issues with neon, and I have no clue what the linker and LTO would do

ppigazzini commented 3 years ago

I find it particularly funny, by the way, that it's the versions that don't use AVX2 and don't use SSE41 that are called "modern". After all, these are the CPUs that are over 10 years old.

AVX2 were introduced with Haswell in Q2 2013, so Sandy/Ivy Bridge (with only AVX) were the "modern" CPUs when fishtest started operation in Q1 2013

https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2

EDIT_000: fishtest broke the 100 workers milestone only in Q4 2015, so at the time made sense to speedup every single ARCH, see the "New CPU record!" thread: https://groups.google.com/g/fishcooking/c/lebEmG5vgng/

ghost commented 3 years ago

Absolutely! It still makes sense to accelerate and get the most out of every single ARCH, in my opinion. I can totally relate to how it evolved with "modern", it's always such a thing with designations like "modern", "latest", "best", etc. they are only current for a while and then you have to keep shifting them. Maybe it's time to adjust that, at least for the user versions offered on the official download page.

Sopel97 commented 3 years ago

OK. I get the sentiment that most possible archs should be supported, even if just for a reference. I'll be working on a solution that will please everyone and maybe end this drama.

ghost commented 3 years ago

Does anyone know a reason why in the official Stockfish 14 release the version https://stockfishchess.org/files/stockfish_14_win_x64_modern.zip is again being called "SSE4.1 + POPCNT" although it was compiled without SSE41 flag. IMHO, it should be correctly named "SSSE3 + POPCNT" as already discussed above, shouldn't it? A quick look at the UCI console is enough to see that this is the case:

C:\Temp\stockfish_14_win_x64_modern\stockfish_14_win_x64_modern>stockfish_14_x64_modern.exe
Stockfish 14 by the Stockfish developers (see AUTHORS file)
compiler

Compiled by g++ (GNUC) 7.3.0 on MinGW64
Compilation settings include:  64bit SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 7.3-posix 20180312

In my opinion, the Web Page still needs to be corrected and say "SSSE3 + POPCNT" instead of "SSE4.1 + POPCNT". Or am I misunderstanding something?

ppigazzini commented 3 years ago

Does anyone know a reason why in the official Stockfish 14 release the version https://stockfishchess.org/files/stockfish_14_win_x64_modern.zip

There are MacOSX files/folders in that archive.

ppigazzini commented 3 years ago

Abrok file is different but compiled with wrong flags https://abrok.eu/stockfish/builds/773dff020968f7a6f590cfd53e8fd89f12e15e36/win64modern/stockfish_21070214_x64_modern.zip

ghost commented 3 years ago

Doesn't anyone notice? That the garbage "__MACOSX" files are included and so? (all of the same size 231 Bytes) Looks like a broken Makefile or some build job step that archives the artifacts wrongly. I mean, it is "The Official Release"...

ppigazzini commented 3 years ago

bynaries have the same SHA256: CDE329F56CD5EC67FCFD06BCE2984BA54C2F472DB40694654DCE65E4BA394B1F, so it's the abrok build.

The official archive was built on a MacOSX machine adding the folder with the source code.

ghost commented 3 years ago

In any case, just because they are the same binaries, they are both compiled with SSSE3 SSE2 POPCNT (see above). Why does the download webpage call this binary "SSE4.1+POPCNT"? I think there is a gap somewhere and the people who create the webpage don't check/know exactly what they write. And nobody notices in the QA that some garbage directories from a MacOSX machine are included, too? Am I the first to notice this? or does everyone else just not care and people have gotten used to the "__MACOSX*" garbage additions - it's always been there, so it's practically become "normal" in the Official Stockfish Windows distrubution. I find it quite confusing. But maybe it's just me...

ppigazzini commented 3 years ago

Mac users are unaware of the problem :) https://perishablepress.com/remove-macosx-ds-store-zip-files-mac/

ghost commented 3 years ago

Well that's nice for them. So there is virtually no problem at all. And there is also no problem with the version names SSE4.1 etc. because the Web Site People are unaware the problem, too. ;-)

ghost commented 3 years ago

To summarise: The current Stockfish build structure and naming (unfortunately also in Stockfish 14) are, in my humble opinion, no longer up to date. Furthermore, some of the names on the web page are simply not correct. I propose the following steamlining:

  1. Merging stockfish__win_x64bmi2 and stockfish_win_x64_avx2 to one build configuration, because both targets are available since the Hashwell CPU microarchitecture (06-2013): https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set#Supporting_CPUs / https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2, why offer them separately in two different build configurations?

  2. POPCNT is not a SIMD instruction. It has nothing to do with SSE4.1 per se. The name SSE4.1 + POPCNT on the Web Page is wrong because SSE4.1 instructions are not used in all Stockfish x64 builds, as explained in detail above.

  3. The designation "modern" should be dispensed with, as it becomes obsolete faster than it can be updated.

The following 5 instead of 6 Windows builds would be more than sufficient at the moment, in my opinion:

Sopel97 commented 3 years ago
  1. Not possible because BMI2 is microcoded on early ryzens
  2. donno
  3. I agree
daylen commented 3 years ago

Thanks for the good discussion. Regarding these 3 points:

  1. "Merging stockfish_win_x64_bmi2 and stockfish_win_x64_avx2 to one build configuration" As @Sopel97 mentions above, BMI2 is actually pretty slow on Ryzen Zen and Zen 2 architectures, and those CPUs are fairly popular. Wikipedia says: "Excavator through Zen 2 processors implement PEXT and PDEP instructions using microcode resulting in the instructions executing significantly slower than the same behaviour recreated using other instructions" source
  2. "The name SSE4.1 + POPCNT on the Web Page is wrong" I was simply echoing what is in the Makefile: https://github.com/official-stockfish/Stockfish/blob/master/src/Makefile#L690 So we should start by changing that. That being said I agree with your compiler argument so I've updated the website: SSE4.1 + POPCNT is now just POPCNT.
  3. "The designation 'modern' should be dispensed with" I agree--I've renamed the zip and exe.

Regarding the __MACOSX files, that's because I zipped the binaries on a Mac, but at some point I re-zipped them on a Windows virtual machine so they should no longer be present if you download them now.

Stepping back, it sure would be elegant if we could distribute a single binary that automatically executed the fastest codepath for the machine its running on. Right now, the main download page (https://stockfishchess.org/download/) tries to guess with a "Faster" and "More compatible" that targets ~86% and ~98.3% of users respectively (by my math) depending on if the user is running Windows 10 or not (source).

ghost commented 3 years ago

Thanks @daylen , @Sopel97 and all for the improvement and comments. I appreciate that the AMD implementation of BMI2 before Zen 3 prevents the build configurations from being merged. I totally agree, a single binary would simplify a lot for developers and especially for Stockfish users. I have started working on this (see draft pull request https://github.com/official-stockfish/Stockfish/pull/3602), but as I said, I am currently on summer holiday. When I get back to it, I will push the next refinement. I'm also planning to batch build the current Stockfish x86 64-bit binaries (with emebedded nnue) with the latest Intel C++ optimizing compiler and Microsoft C++ compiler to compare them with GCC binaries performance-wise. I think Windows users don't care what compiler Stockfish .exe was built with and we should offer the fastest possible binaries on the official download page. It also has advantages for TCEC and CCRL, where a few extra percent can make a big difference.