Closed Zerbinati closed 3 years ago
@Zerbinati Can you give this a try, please? Not sure though, if it works ... (Talkchess seems to be down atm.)
@joergoster I am getting this error when compiling:
3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o half_kp.o nnue/features/half_kp.cpp misc.cpp: In function 'void Stockfish::WinProcGroup::bindThisThread(size_t)': misc.cpp:575:68: error: expected ';' before 'group' 575 | sync_cout << "info string Binding thread " << idx << " to group " group << sync_endl; | ^~~~~~ | ; mingw32-make[2]: *** [<builtin>: misc.o] Error 1 mingw32-make[2]: *** Waiting for unfinished jobs....
@Zerbinati Sorry, one <<
operator was missing. Try again, please.
@joergoster Compilation is perfect but the ram problem is still there 64 threads 77000 KN \ s hash size = 128mb 128 threads 112000 KN \ s hash size = 128mb 64 threads 77000 KN \ s hash size = 1024mb 128 threads 88000 KN \ s hash size = 1024mb
over 1024 no increase from 64 to 128 Threads 64 threads 73000 KN \ s hash size = 2048mb 128 threads 73000 KN \ s hash size = 2048mb
Strange. Can you also try with 96 threads? It's really weird that you should get no benefit as soon as you increase the hash size.
I guess going from 32 to 64 threads, you get the expected gain even with say 16 or 32 GB Hash. Right?
Further question: are all RAM slots in use?
boost 32-64 Threads is ok 130%
Thread s 64 or 96 same KN\s
ram 2 slot 2x 64GB ECC DDR4-3200 Dual Channel
boost 32-64 Threads is ok 130%
This is already less than I would expect. What version do you use? avx2? Do you get same or better performance with "Use NNUE" set to false?
@sopel channels because I intend to increase the ram in the future so it didn't make sense to occupy more slots with smaller cuts. Large pages are enabled. In an effort to resolve we must not forget that any other engine has no problem whatsoever.
@joergoster I use avx2 NNUE = false better performance but no increase from 64 to 128 Stockfish 10 and 11 same problem as I tried to reproduce the same conditions of the Tcec with Komodo from dragon, Ethereal and slowchess same time, same depth of analysis achieved and they are perfectly in line with the nodes developed by the Quad socket machine used by Tcec 3995WX slightly higher in the order of 3-5%
I can't say much about the internal architecture of this CPU, but is it possible that 2 Threads have to share the same SIMD units for the AVX2 calculations?
Please note, ipman's listed benchmarks are done with asmFish (no NNUE)! Have you tried this special asmFish version with your machine?
yes they are on the list 157.136.270 | AMD Ryzen Threadripper 3 3995WX Pro | 128 fili | pop + LP | Marco Zerbinati
also bmi2 and modern same problem
There was a similar issue in CCC and it was resolved by filling all memory channels.
I'm sorry and with a lot of humility I don't think it's an acceptable solution. If with all the other engines the problem does not arise, surely it is in the code that I would try to solve if I had the right skills.
I can't say much about the internal architecture of this CPU, but is it possible that 2 Threads have to share the same SIMD units for the AVX2 calculations?
Only SMT hyperthreads of the same core share execution resources (as that's what SMT is), actual cores are completely independent from one another. Only the L3 cache — 16 MB per each core complex of 4 cores — and external interfaces are shared.
@Fanael Thank you!
@Fanael grazie.
@Zerbinati Regarding thread-binding, Ethereal does exactly the same as Stockfish.
But when it comes to probing the Transposition Table, Ethereal doesn't keep pointers to an entry but makes a local copy of the entry. In this branch I tried to do the same in Stockfish. I don't have high hopes it will help with your issue, but if you want to give it a try ... who knows. ;-)
Very kind Joerg, after the tournament compile in and let you know.
@joergoster I tried to compile and test but nothing changes.
Too bad! Now I have no ideas left.
@joergoster no problem.. thanks anyway for all your help.
Is converting to Ubuntu an option!? Even before doing that install the Ubuntu for windows and see if there is an issue while checking out if Ubuntu is for you. Stockfish runs 10% faster on Ubuntu.
@MichaelB7 yes I have already tested ubuntu and it works very well, unfortunately there is no compatibility with the programs I use.
@Zerbinati If this is a low memory bandwith relative to high NPS issue you may want to see if the change in this PR fixes your issue. https://github.com/official-stockfish/Stockfish/pull/3288
@mstembera thanks for your intervention, I have implemented the patch and compiled, unfortunately nothing has changed.
@Sopel97 in the end I tried as a last solution what you suggested and I solved! Thx! Marco
Whenever Windows experiences more than 64 threads in a system, it separates those threads into processor groups. The way this is done is very rudimentary: of the enumerated cores and threads, the first 64 go into the first group, the second 64 go into the next group, and so on. So I can only use 64 Threads of a group. Would it be possible to insert the "set affinity" statement in the Stockfish code? I can to compile and test any possible solution Thanks in advance Marco