syzygy1 / Cfish

C port of Stockfish
GNU General Public License v3.0
137 stars 59 forks source link

Sparse and AVX2 #172

Open syzygy1 opened 4 years ago

syzygy1 commented 4 years ago

On my AVX2 laptop, sparse multiplication now turns out to be slower than the non-sparse multiplication. I suspect that this is not the case on some other AVX2 CPUs, in particular Zen 1.

I have therefore added a compilation option. To compile with sparse multiplication: make -j pgo sparse=yes To compile without sparse multiplication: make -j pgo sparse=no

By default "sparse=yes" except for AVX2 targets (including BMI2, VNNI, AVX512).

If it is clear that "sparse=no" is still faster on Zen 1 or on other CPUs with AVX2, I can make it the default on those CPUs. I cannot test this myself, so if anyone is willing to try sparse=yes/no on Zen 1 or other CPUs, that would be very welcome.

It would also be interesting to know if sparse=no is faster on any non-AVX2 CPUs.

syzygy1 commented 4 years ago

The number of search threads might also have an impact on which is faster...

JavaMast commented 4 years ago

Screenshot_185

JavaMast commented 4 years ago

311020_1 = Correctly display castling rights for Chess960. 311020_2 = Improve non-sparse multiplication.

JavaMast commented 4 years ago

Screenshot_186

*Ryzen 3900X @3.8 GHz

JavaMast commented 4 years ago

Screenshot_187

syzygy1 commented 4 years ago

Thanks, so sparse AVX2 is still clearly better on AMD. Were these all tested on Ryzen 3900X?

JavaMast commented 4 years ago

Yes, all on my Ryzen 3900X.

I hope to get tests on another CPUs soon.

JavaMast commented 4 years ago

Intel i5 760 (Nehalem), 2,95 GHz

bench 16 1 13 default depth NNUE bench 16 1 13 default depth Pure

bench 16 3 13 default depth NNUE

JavaMast commented 4 years ago

Athlon_x4_870K Athlon_x4_870K

syzygy1 commented 4 years ago

Thanks again!

So on Nehalem, no_sparse is now better than sparse, which was the other way around before the improvement. On my Sandybridge PC, no_sparse is improved, but sparse is still better. So there is no clear Intel rule here.

The Athlon resutls have a pretty high variance, but seem to suggest sparse is better.

JavaMast commented 4 years ago

Intel Core i5-7600K Intel Core i5-7600K_NNUE Intel Core i5-7600K_Pure

JavaMast commented 4 years ago

Intel 6800k

Intel 6800k 1 Intel 6800k 2 Intel 6800k 3 Intel 6800k 4 Intel 6800k 5 Intel 6800k 6

JavaMast commented 4 years ago

i7-7700HQ @2.80GHz

i7-7700HQ

syzygy1 commented 4 years ago

Thanks. So sparse=no is now better on Intel AVX2. For SSE2, sparse=yes is better. (I have now improved non-sparse for SSE2, but it still doesn't get close to sparse.) For SSSE3/SSE41, there is no clear winner on Intel.

On AMD, sparse=yes seems better.

JavaMast commented 4 years ago

It looks like this.

I am very confused by the results on Athlon 870K - today more tests were carried out and the variance has become even greater.

Athlon_x4_870K 2

Was tested with network nn-cb26f10b1fd9.nnue

syzygy1 commented 4 years ago

Maybe the cpu is overheating and then throttles down?

AlexB123 commented 4 years ago

It looks like this.

I am very confused by the results on Athlon 870K - today more tests were carried out and the variance has become even greater.

Athlon_x4_870K 2

Was tested with network nn-cb26f10b1fd9.nnue

Hello guys! Above test was made on my PC, same as below speed tests. Recently my brother made a small update on my PS, and he didn't tell me that now i have Turbo boost, so now i have to learn how to switch the Turbo boost off (lol). I've repeated speed test with "Warm up CPU", speed looks more less correct. Speed Speed2

syzygy1 commented 4 years ago

@AlexB123 Which CPU is that? It seems non-sparse might be a little bit better with 1 thread (except for SSE2, which is expected) but loses to sparse with multiple threads. Non-sparse probably uses a bit more power and therefore increases CPU temps more.

JavaMast commented 4 years ago

@syzygy1 This is Athlon 870K

syzygy1 commented 4 years ago

Ah, I see now.

JavaMast commented 4 years ago

Looks like no_sparse is faster on new AMD CPUs AMD RYZEN 9 5950x Screenshot 2020-11-16 12 40 12

================== Hope to see BMI2 builds in speed test soon.

JavaMast commented 4 years ago

AMD RYZEN 9 5950x Screenshot 2020-11-16 15 08 36

Screenshot 2020-11-16 15 19 21

JavaMast commented 3 years ago

After "Updated to "AVX512, AVX2 and SSSE3 speedups"." Ryzen 3900X

Screenshot_232

syzygy1 commented 3 years ago

What is the difference between SSSE3.exe and SSSE3_popcnt_mingw_10.exe ?

syzygy1 commented 3 years ago

I think the fact that no_sparse now beats sparse on Zen 3 shows that AMD has improved their AVX2 implementation in Zen 3.

JavaMast commented 3 years ago

What is the difference between SSSE3.exe and SSSE3_popcnt_mingw_10.exe ?

SSSE3 and SSSE3_sparse is 32-bit builds (compiled in MinGW i686-8.1.0-posix-dwarf-rt_v6-rev0)

syzygy1 commented 3 years ago

OK, so for 64-bit SSSE3 on Zen 2, sparse=yes is still faster than sparse=no.

But it seems sparse=no is now faster than sparse=yes for AVX2 on Zen 2. I thought sparse=yes was clearly faster before the AVX2 speed up. This suggests that sparse=no is now faster on all CPUs with AVX2.

syzygy1 commented 3 years ago

I just tested a Ryzen 4500U laptop and also found that sparse=yes was faster than sparse=no before the AVX2 speedup patch and is now slower.

JavaMast commented 3 years ago

Hello!

Sparse=no faster for all builds except SSE2 on Core i5 - 11400f.

AVX512_VNNI fastest

Screenshot_350

JavaMast commented 3 years ago

Just curious, on my i5 11400f Cish is faster with Pure mode:

Screenshot_369 Screenshot_371

Only for AVX2 builds and higher. Not for SSE builds. On Ryzen 3900X - NNUE is still faster than Pure.

syzygy1 commented 3 years ago

Pure being fasted is pretty nice. Is it also stronger?

JavaMast commented 3 years ago

No, Hybrid still stronger

BMI2 10+0,1 concurrency 6

Score of Cfish_x64_120421_ELTO_BMI2 vs Cfish_x64_130421_ELTO_BMI2_Pure: 668 - 521 - 6564 [0.509] ... Cfish_x64_120421_ELTO_BMI2 playing White: 520 - 138 - 3219 [0.549] 3877 ... Cfish_x64_120421_ELTO_BMI2 playing Black: 148 - 383 - 3345 [0.470] 3876 ... White vs Black: 903 - 286 - 6564 [0.540] 7753 Elo difference: 6.6 +/- 3.0, LOS: 100.0 %, DrawRatio: 84.7 % 7758 of 20000 games finished.

AVX512_VNNI 10+0,1 concurrency 5

Score of Cfish_x64_120421_ELTO_AVX512_VNNI vs Cfish_x64_130421_ELTO_AVX512_VNNI_Pure: 527 - 507 - 6038 [0.501] ... Cfish_x64_120421_ELTOAVX512VNNI playing White: 406 - 119 - 3011 [0.541] 3536 ... Cfish_x64_120421_ELTO_AVX512___VNNI playing Black: 121 - 388 - 3027 [0.462] 3536 ... White vs Black: 794 - 240 - 6038 [0.539] 7072 Elo difference: 1.0 +/- 3.1, LOS: 73.3 %, DrawRatio: 85.4 % 7076 of 20000 games finished.

JavaMast commented 3 years ago

@syzygy1 did you know how much Cfish faster on an old CPUs? My friend with Phenom II x6 1100T (SSE2 build compatible) told me that Cfish is 2 times faster than Stockfish... On my i5-11400f it is "only" 50% faster Screenshot_136

even x32 build is faster

Screenshot_137