official-stockfish / Stockfish

A free and strong UCI chess engine
https://stockfishchess.org/
GNU General Public License v3.0
11.56k stars 2.27k forks source link

New AMD Ryzen™ Threadripper™ PRO 3995WX (Windows and Multithreading Problem) #3440

Closed Zerbinati closed 3 years ago

Zerbinati commented 3 years ago

Whenever Windows experiences more than 64 threads in a system, it separates those threads into processor groups. The way this is done is very rudimentary: of the enumerated cores and threads, the first 64 go into the first group, the second 64 go into the next group, and so on. So I can only use 64 Threads of a group. Would it be possible to insert the "set affinity" statement in the Stockfish code? I can to compile and test any possible solution Thanks in advance Marco

joergoster commented 3 years ago

@Zerbinati Can you give this a try, please? Not sure though, if it works ... (Talkchess seems to be down atm.)

Zerbinati commented 3 years ago

@joergoster I am getting this error when compiling: 3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o half_kp.o nnue/features/half_kp.cpp misc.cpp: In function 'void Stockfish::WinProcGroup::bindThisThread(size_t)': misc.cpp:575:68: error: expected ';' before 'group' 575 | sync_cout << "info string Binding thread " << idx << " to group " group << sync_endl; | ^~~~~~ | ; mingw32-make[2]: *** [<builtin>: misc.o] Error 1 mingw32-make[2]: *** Waiting for unfinished jobs....

joergoster commented 3 years ago

@Zerbinati Sorry, one << operator was missing. Try again, please.

Zerbinati commented 3 years ago

@joergoster Compilation is perfect but the ram problem is still there 64 threads 77000 KN \ s hash size = 128mb 128 threads 112000 KN \ s hash size = 128mb 64 threads 77000 KN \ s hash size = 1024mb 128 threads 88000 KN \ s hash size = 1024mb


over 1024 no increase from 64 to 128 Threads 64 threads 73000 KN \ s hash size = 2048mb 128 threads 73000 KN \ s hash size = 2048mb

joergoster commented 3 years ago

Strange. Can you also try with 96 threads? It's really weird that you should get no benefit as soon as you increase the hash size.

I guess going from 32 to 64 threads, you get the expected gain even with say 16 or 32 GB Hash. Right?

Further question: are all RAM slots in use?

Zerbinati commented 3 years ago

boost 32-64 Threads is ok 130%

Thread s 64 or 96 same KN\s

ram 2 slot 2x 64GB ECC DDR4-3200 Dual Channel

Sopel97 commented 3 years ago
  1. 2 channels is very little for that many cores
  2. are you using large pages?
joergoster commented 3 years ago

boost 32-64 Threads is ok 130%

This is already less than I would expect. What version do you use? avx2? Do you get same or better performance with "Use NNUE" set to false?

Zerbinati commented 3 years ago

@sopel channels because I intend to increase the ram in the future so it didn't make sense to occupy more slots with smaller cuts. Large pages are enabled. In an effort to resolve we must not forget that any other engine has no problem whatsoever.

Zerbinati commented 3 years ago

@joergoster I use avx2 NNUE = false better performance but no increase from 64 to 128 Stockfish 10 and 11 same problem as I tried to reproduce the same conditions of the Tcec with Komodo from dragon, Ethereal and slowchess same time, same depth of analysis achieved and they are perfectly in line with the nodes developed by the Quad socket machine used by Tcec 3995WX slightly higher in the order of 3-5%

joergoster commented 3 years ago

I can't say much about the internal architecture of this CPU, but is it possible that 2 Threads have to share the same SIMD units for the AVX2 calculations?

Please note, ipman's listed benchmarks are done with asmFish (no NNUE)! Have you tried this special asmFish version with your machine?

Zerbinati commented 3 years ago

yes they are on the list 157.136.270 | AMD Ryzen Threadripper 3 3995WX Pro | 128 fili | pop + LP | Marco Zerbinati

also bmi2 and modern same problem

Sopel97 commented 3 years ago

There was a similar issue in CCC and it was resolved by filling all memory channels.

Zerbinati commented 3 years ago

I'm sorry and with a lot of humility I don't think it's an acceptable solution. If with all the other engines the problem does not arise, surely it is in the code that I would try to solve if I had the right skills.

Fanael commented 3 years ago

I can't say much about the internal architecture of this CPU, but is it possible that 2 Threads have to share the same SIMD units for the AVX2 calculations?

Only SMT hyperthreads of the same core share execution resources (as that's what SMT is), actual cores are completely independent from one another. Only the L3 cache — 16 MB per each core complex of 4 cores — and external interfaces are shared.

joergoster commented 3 years ago

@Fanael Thank you!

Zerbinati commented 3 years ago

@Fanael grazie.

joergoster commented 3 years ago

@Zerbinati Regarding thread-binding, Ethereal does exactly the same as Stockfish.

But when it comes to probing the Transposition Table, Ethereal doesn't keep pointers to an entry but makes a local copy of the entry. In this branch I tried to do the same in Stockfish. I don't have high hopes it will help with your issue, but if you want to give it a try ... who knows. ;-)

Zerbinati commented 3 years ago

Very kind Joerg, after the tournament compile in and let you know.

Zerbinati commented 3 years ago

@joergoster I tried to compile and test but nothing changes.

joergoster commented 3 years ago

Too bad! Now I have no ideas left.

Zerbinati commented 3 years ago

@joergoster no problem.. thanks anyway for all your help.

MichaelB7 commented 3 years ago

Is converting to Ubuntu an option!? Even before doing that install the Ubuntu for windows and see if there is an issue while checking out if Ubuntu is for you. Stockfish runs 10% faster on Ubuntu.

Zerbinati commented 3 years ago

@MichaelB7 yes I have already tested ubuntu and it works very well, unfortunately there is no compatibility with the programs I use.

mstembera commented 3 years ago

@Zerbinati If this is a low memory bandwith relative to high NPS issue you may want to see if the change in this PR fixes your issue. https://github.com/official-stockfish/Stockfish/pull/3288

Zerbinati commented 3 years ago

@mstembera thanks for your intervention, I have implemented the patch and compiled, unfortunately nothing has changed.

Zerbinati commented 3 years ago

@Sopel97 in the end I tried as a last solution what you suggested and I solved! Thx! Marco