Closed ghost closed 3 years ago
SSE4.1 isn't a speedup over SSE3 in our use case. It should be removed until someone actually finds a use for it. Website could be updated.
I just had a quick look at the current code and the only places where USE_SSE41 is used are in src/nnue/*. But only when USE_AVX2 is not set.
However, in the only two 64-bit build configurations where USE_SSE41 is currently set (Windows x64 for Haswell CPUs and Windows x64 for modern computers + AVX2), the flag USE_AVX2 is also set at the same time.
This means that the USE_SSE41 flag - for 64-bit at least - does not produce any code, as it is virtually overwritten by USE_AVX2.
Note: As far as I can see, there is just one 'x86-32-sse41-popcnt' build target defined in the Makefile, where USE_SSE41 is then actually used. By the way, this is not the generic '32-bit: Maximally compatible but slow' x86-32 build offered on the download page.
Not sure I understand, but my modern build states it uses SSE41.
Compiled by g++ (GNUC) 9.3.0 on Linux
Compilation settings include: 64bit SSE41 SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 9.3.0
The point I am making is that the buildflag SSE41 is set in the Makefile, but it is completely useless in all 64-bit builds. Therefore it might as well be removed in all 64-bit configurations from the Makefile, as I suggested here https://github.com/official-stockfish/Stockfish/pull/3589. The resulting binaries will be identical in my opinion (except for where the compiler flag "SSE41" is printed out on the console).
As I understand it, they will never be used in 64-bit targets, but there is currently one 32-bit target where they are used: 'x86-32-sse41-popcnt'. So I would not remove them from 'nnue_feature_transformer.h' just yet.
Joerg is correct.
This means that the USE_SSE41 flag - for 64-bit at least - does not produce any code, as it is virtually overwritten by USE_AVX2.
That's not true. There are processors where AVX2 is not available but SSE4.1 is.
Intel Core i5 760
Intel Core i5-7600K
Intel E8500
Intel Core i7-720QM
Before we go into the question of what combination of flags anyone can compile on their own development machine, let's stick with the official Stockfish 13 binaries. Originally the question about the flags referred to the official Stockfish 13 binary distributables, which users worldwide will get from the official download sites:
To make it clear what I mean, I've written out the actual compiler flag combinations that these binaries were actually built with (that's what the binaries themselves report):
Compiled by g++ (GNUC) 7.5.0 on Linux
__VERSION__ macro expands to: 7.5.0
stockfish_13_linux_x64_bmi2: 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_avx2: 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_modern: 64bit SSSE3 SSE2 POPCNT
stockfish_13_linux_x64_ssse: 64bit SSSE3 SSE2
stockfish_13_linux_x64: 64bit SSE2
Compiled by g++ (GNUC) 7.3.0 on MinGW64/MinGW32
__VERSION__ macro expands to: 7.3-posix 20180312
stockfish_13_win_x64_bmi2: 64bit BMI2 AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_win_x64_avx2: 64bit AVX2 SSE41 SSSE3 SSE2 POPCNT
stockfish_13_win_x64_modern: 64bit SSSE3 SSE2 POPCNT
stockfish_13_win_x64_ssse: 64bit SSSE3 SSE2
stockfish_13_win_x64: 64bit SSE2
stockfish_13_win_32bit: 32bit MMX
Now, do we agree on the fact that all four occurrences of SSE41 in the officially created binaries for Linux and Windows are nonsense, because those are the AVX2 versions in the first place and setting SSE41 is useless in that case (see actual Stockfish source code, as explained above)?
Perhaps we should consider dropping even more archs? AVX2 is supported pretty much since 2011. There is only a few processors that support SSSE3 but don't support popcount. We could reduce the set to 32-bit, 64-bit, [SSE2, ]SSSE3+POPCNT, AVX2, BMI2, VNNI256, AVX512, VNNI512. All this with transitivity, so no builds with POPCNT and not SSSE3, or with 32bit see support. The performance <AVX2 is not very good so we shouldn't care much either way, these people don't care about performance.
I find it particularly funny, by the way, that it's the versions that don't use AVX2 and don't use SSE41 that are called "modern". After all, these are the CPUs that are over 10 years old.
@Sopel97 BTW, VNNI builds is very slow now https://github.com/official-stockfish/Stockfish/issues/3457
@Sopel97 BTW, VNNI builds is very slow now
3457
I'm aware of that, and I have some understanding why that is. Fixing it is an option but requires a partial revert that would blow the size of the code a bit. It's on my radar but I want to get the current drama, and the overall ARCH situation settled.
You're trying to push it to an extreme in two ways here to support your argument.
btw. steam hardware survey from may 2021 https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam
I'm being very generous with wanting to support SSE2 and 32bit. I'm the 2% minority in both of these tables.
Here is another "the 100 most common CPUs" statistic (as of 27-Jun-2021) from https://cpu.userbenchmark.com/
# CPU Mkt. Age
share % months
1 AMD Ryzen 5 3600 4.5 23
2 AMD Ryzen 7 3700X 2.93 23
3 AMD Ryzen 5 5600X 2.51 7
4 AMD Ryzen 7 5800X 1.88 7
5 AMD Ryzen 5 2600 1.78 38
6 Intel Core i7-10750H 1.45 16+
7 Intel Core i7-9700K 1.43 32
8 AMD Ryzen 9 5900X 1.34 7
9 Intel Core i9-9900K 1.3 32
10 Intel Core i7-8700K 1.23 44
11 Intel Core i5-9400F 1.19 28
12 Intel Core i7-10700K 1.13 13
13 AMD Ryzen 7 2700X 1.14 38
14 AMD Ryzen 9 3900X 1.05 23
15 Intel Core i7-9750H 1.01 26+
16 Intel Core i7-7700HQ 1.0 54+
17 AMD Ryzen 5 3600X 0.96 23
18 Intel Core i5-10400F 0.96 13
19 Intel Core i7-8750H 0.91 39+
20 Intel Core i7-7700K 0.93 53
21 Intel Core i7-6700K 0.92 70
22 Intel Core i5-9600K 0.83 32
23 Intel Core i7-8700 0.82 44
24 AMD Ryzen 5 2600X 0.71 38
25 Intel Core i5-8250U 0.65 45+
26 Intel Core i5-10300H 0.66 17+
27 Intel Core i7-3770 0.64 111
28 Intel Core i7-4790K 0.67 84
29 Intel Core i7-4790 0.64 85
30 Intel Core i5-7200U 0.62 59+
31 AMD Ryzen 7 3800X 0.62 23
32 Intel Core i9-10850K 0.62 10+
33 AMD Ryzen 9 5950X 0.62 7
34 Intel Core i5-8400 0.66 44
35 Intel Core i7-7700 0.6 53
36 Intel Core i7-6700HQ 0.58 69+
37 Intel Core i5-7400 0.55 53
38 Intel Core i7-6700 0.58 69
39 Intel Core i5-3470 0.54 110
40 Intel Core i9-10900K 0.56 13
41 AMD Ryzen 7 4800H 0.54 16+
42 AMD Ryzen 5 3400G 0.54 24+
43 AMD Ryzen 7 5800H 0.51 5+
44 Intel Core i5-6500 0.48 69
45 Intel Core i7-10700 0.49 13
46 Intel Core i5-4460 0.5 85
47 Intel Core i5-9300H 0.46 27+
48 AMD Ryzen 3 3200G 0.48 24+
49 Intel Core i5-1135G7 0.46 9+
50 AMD FX-8350 0.45 104
51 AMD Ryzen 5 1600 0.48 50
52 AMD Ryzen 5 4600H 0.45 14+
53 Intel Core i5-8300H 0.46 41+
54 Intel Core i7-1165G7 0.45 12+
55 Intel Core i7-11700K 0.45 3
56 Intel Core i5-6600K 0.43 70
57 Intel Core i5-10400 0.44 13
58 AMD Ryzen 3 2200G 0.41 40+
59 Intel Core i5-10600K 0.42 13
60 Intel Core i7-8550U 0.41 46+
61 Intel Core i7-2600 0.4 125
62 Intel Core i5-4590 0.36 83
63 Intel Core i3-9100F 0.41 26
64 AMD FX-6300 0.4 104
65 Intel Core i5-2400 0.4 125
66 Intel Core i5-8600K 0.4 44
67 Intel Core i7-4770 0.38 99
68 AMD Ryzen 5 3500U 0.4 26+
69 Intel Core i5-6200U 0.37 68+
70 Intel Core i7-10700F 0.37 12+
71 Intel Core i7-7500U 0.38 57+
72 Intel Core i5-4570 0.35 98
73 AMD Ryzen 7 2700 0.38 38
74 AMD Ryzen 5 1600AF 0.37 34+
75 Intel Core i5-6400 0.36 69
76 Intel Core i5-1035G1 0.35 21+
77 Intel Core i5-8265U 0.34 34+
78 AMD Ryzen 5 3500X 0.34 20
79 Intel Core i5-7300HQ 0.34 53+
80 AMD Ryzen 5 3550H 0.32 28+
81 Intel Core i5-6300U 0.3 73+
82 Intel Core i5-4690K 0.31 85
83 Intel Core i5-10210U 0.31 24+
84 Intel Core i7-9700 0.32 24+
85 Intel Core i5-7500 0.3 53
86 Intel Core i7-10700KF 0.29 12+
87 AMD Ryzen 9 5900HS 0.28 4+
88 Intel Core i5-5200U 0.27 77+
89 Intel Core i5-7600K 0.27 53
90 Intel Core i7-10870H 0.27 8+
91 Intel Core i7-10875H 0.28 14+
92 Intel Core i7-10510U 0.26 21+
93 AMD Ryzen 3 3100 0.27 13+
94 AMD Ryzen 5 2400G 0.27 40+
95 AMD Ryzen 7 1700 0.26 51
96 Intel Core i5-3570 0.26 90+
97 Intel Core i7-3770K 0.25 110
98 Intel Core i9-11900K 0.26 3
99 Intel Core i5-3570K 0.26 110
100 Intel Core i5-11600K 0.26 2
There is another problem with the current Stockfish binary in this context. It is absolutely static and crashes on me without any error message when I start e.g. a BMI2 version on an Intel Core i5-3450 @ 3.10GHz (Ivy Bridge). Only in the debugger, you can see what's going on: Unhandled exception at 0x0000013F1D32B5 in stockfish.exe: 0xC000001D: Illegal Instruction. A "normal user" would be completely overwhelmed by this. A program simply crashes without at least displaying a message as to why. No software should behave like this in 2021, in my humble opinion. It's really not that hard to check the processor capabilities at startup, i.e. at runtime (and not at compile time), this very simple Github project shows how to do it: https://github.com/Mysticial/FeatureDetector
So if I were to think this through and wish for something, this old-fashioned #ifdef #else NNUE C-programming style from the 80's should soon be replaced with real C++ concepts like Strategy Design Patterns as far as the different CPU configurations are concerned. It's already a mess, how else to maintain it in the future? I could make a PR proposal, but I'm not sure if it would be widely accepted and worth the effort. I'm new to this Stockfish community and can't really assess how it works right now...
I think a single binary that is the fastest on all CPU configs would be very neat. It would certainly avoid confusion for people that just use it as a tool for human v human chess and probably don't know which CPU they have. If you make a PR, keep in mind that on Zen and Zen2 bmi2 is a slowdown while on Zen3 it is natively implemented and is faster. Also, during compilation there should be an argument to restrict it to say bmi2 and 'lower' since most CPUs are incapable to profile-build (pgo) AVX512.
I think a single binary that doesn't crash on any CPU and unleashes the best capabilities of Stockfish without silly user queries would simply be in the spirit of the times and what to expect in 2021. I could suggest a PR for this binary, but also don't want to waste my time when someone who will end up deciding this is already signaling "nah, we don't really need that...".
Could someone who has to decide this please comment? My time is valuable to me... Thanks!
This would require the compiler to support all archs being compiled for, separate cpp files for things that need to be compiled with -mavx2, -mavx512 and so on, issues with neon, and I have no clue what the linker and LTO would do
I find it particularly funny, by the way, that it's the versions that don't use AVX2 and don't use SSE41 that are called "modern". After all, these are the CPUs that are over 10 years old.
AVX2 were introduced with Haswell in Q2 2013, so Sandy/Ivy Bridge (with only AVX) were the "modern" CPUs when fishtest started operation in Q1 2013
https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2
EDIT_000: fishtest broke the 100 workers milestone only in Q4 2015, so at the time made sense to speedup every single ARCH, see the "New CPU record!" thread: https://groups.google.com/g/fishcooking/c/lebEmG5vgng/
Absolutely! It still makes sense to accelerate and get the most out of every single ARCH, in my opinion. I can totally relate to how it evolved with "modern", it's always such a thing with designations like "modern", "latest", "best", etc. they are only current for a while and then you have to keep shifting them. Maybe it's time to adjust that, at least for the user versions offered on the official download page.
OK. I get the sentiment that most possible archs should be supported, even if just for a reference. I'll be working on a solution that will please everyone and maybe end this drama.
Does anyone know a reason why in the official Stockfish 14 release the version https://stockfishchess.org/files/stockfish_14_win_x64_modern.zip is again being called "SSE4.1 + POPCNT" although it was compiled without SSE41 flag. IMHO, it should be correctly named "SSSE3 + POPCNT" as already discussed above, shouldn't it? A quick look at the UCI console is enough to see that this is the case:
C:\Temp\stockfish_14_win_x64_modern\stockfish_14_win_x64_modern>stockfish_14_x64_modern.exe
Stockfish 14 by the Stockfish developers (see AUTHORS file)
compiler
Compiled by g++ (GNUC) 7.3.0 on MinGW64
Compilation settings include: 64bit SSSE3 SSE2 POPCNT
__VERSION__ macro expands to: 7.3-posix 20180312
In my opinion, the Web Page still needs to be corrected and say "SSSE3 + POPCNT" instead of "SSE4.1 + POPCNT". Or am I misunderstanding something?
Does anyone know a reason why in the official Stockfish 14 release the version https://stockfishchess.org/files/stockfish_14_win_x64_modern.zip
There are MacOSX files/folders in that archive.
Abrok file is different but compiled with wrong flags https://abrok.eu/stockfish/builds/773dff020968f7a6f590cfd53e8fd89f12e15e36/win64modern/stockfish_21070214_x64_modern.zip
Doesn't anyone notice? That the garbage "__MACOSX" files are included and so? (all of the same size 231 Bytes) Looks like a broken Makefile or some build job step that archives the artifacts wrongly. I mean, it is "The Official Release"...
bynaries have the same SHA256: CDE329F56CD5EC67FCFD06BCE2984BA54C2F472DB40694654DCE65E4BA394B1F, so it's the abrok build.
The official archive was built on a MacOSX machine adding the folder with the source code.
In any case, just because they are the same binaries, they are both compiled with SSSE3 SSE2 POPCNT (see above). Why does the download webpage call this binary "SSE4.1+POPCNT"? I think there is a gap somewhere and the people who create the webpage don't check/know exactly what they write. And nobody notices in the QA that some garbage directories from a MacOSX machine are included, too? Am I the first to notice this? or does everyone else just not care and people have gotten used to the "__MACOSX*" garbage additions - it's always been there, so it's practically become "normal" in the Official Stockfish Windows distrubution. I find it quite confusing. But maybe it's just me...
Mac users are unaware of the problem :) https://perishablepress.com/remove-macosx-ds-store-zip-files-mac/
Well that's nice for them. So there is virtually no problem at all. And there is also no problem with the version names SSE4.1 etc. because the Web Site People are unaware the problem, too. ;-)
To summarise: The current Stockfish build structure and naming (unfortunately also in Stockfish 14) are, in my humble opinion, no longer up to date. Furthermore, some of the names on the web page are simply not correct. I propose the following steamlining:
Merging stockfish__win_x64bmi2 and stockfish_win_x64_avx2 to one build configuration, because both targets are available since the Hashwell CPU microarchitecture (06-2013): https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set#Supporting_CPUs / https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2, why offer them separately in two different build configurations?
POPCNT is not a SIMD instruction. It has nothing to do with SSE4.1 per se. The name SSE4.1 + POPCNT on the Web Page is wrong because SSE4.1 instructions are not used in all Stockfish x64 builds, as explained in detail above.
The designation "modern" should be dispensed with, as it becomes obsolete faster than it can be updated.
The following 5 instead of 6 Windows builds would be more than sufficient at the moment, in my opinion:
BMI2 + AVX2: (for CPUs later than ~2013) _= stockfish_14_win_x64bmi2.zip https://en.wikipedia.org/wiki/Bit_manipulation_instruction_set#Supporting_CPUs https://en.wikipedia.org/wiki/Advanced_Vector_Extensions#CPUs_with_AVX2
POPCNT + SSSE3: (for CPUs later than ~2008) _= stockfish_14_win_x64modern.zip https://en.wikipedia.org/wiki/Nehalem_(microarchitecture)
SSSE3: (for CPUs later than ~2006) _= stockfish_14_win_x64ssse.zip https://en.wikipedia.org/wiki/SSSE3#CPUs_with_SSSE3
64-bit (for CPUs later than ~2003) _= stockfish_14_winx64.zip https://en.wikipedia.org/wiki/SSE2#CPU_support
32-bit (for CPUs later than ~1997) _= stockfish_14_win32bit.zip https://en.wikipedia.org/wiki/MMX_(instruction_set)
Thanks for the good discussion. Regarding these 3 points:
compiler
argument so I've updated the website: SSE4.1 + POPCNT
is now just POPCNT
. Regarding the __MACOSX
files, that's because I zipped the binaries on a Mac, but at some point I re-zipped them on a Windows virtual machine so they should no longer be present if you download them now.
Stepping back, it sure would be elegant if we could distribute a single binary that automatically executed the fastest codepath for the machine its running on. Right now, the main download page (https://stockfishchess.org/download/) tries to guess with a "Faster" and "More compatible" that targets ~86% and ~98.3% of users respectively (by my math) depending on if the user is running Windows 10 or not (source).
Thanks @daylen , @Sopel97 and all for the improvement and comments. I appreciate that the AMD implementation of BMI2 before Zen 3 prevents the build configurations from being merged. I totally agree, a single binary would simplify a lot for developers and especially for Stockfish users. I have started working on this (see draft pull request https://github.com/official-stockfish/Stockfish/pull/3602), but as I said, I am currently on summer holiday. When I get back to it, I will push the next refinement. I'm also planning to batch build the current Stockfish x86 64-bit binaries (with emebedded nnue) with the latest Intel C++ optimizing compiler and Microsoft C++ compiler to compare them with GCC binaries performance-wise. I think Windows users don't care what compiler Stockfish .exe was built with and we should offer the fastest possible binaries on the official download page. It also has advantages for TCEC and CCRL, where a few extra percent can make a big difference.
According to the official Download Stockfish for Windows page https://stockfishchess.org/download/windows/ the Windows binary "stockfish_13_win_x64_modern.zip" would be compiled as "SSE4.1 + POPCNT: Intel processors after ~2008, AMD processors after ~2011". However, inside the binary itself the information states that SSE41 support was not enabled in the Makefile during the build process, only 64bit SSSE3 SSE2 POPCNT flags were used:
In my humble opinion, either the Web page should read 🐇 SSSE3 + POPCNT: Intel processors after ~2008, AMD processors after ~2011 or the build process has to be corrected to use SSE41 flag as well. (currently this flag is only set in BMI2 and AVX2 builds).