hero2017 commented 4 years ago

I'm trying to test it on a Win'10 Pro pc (x64) for the first time and checking Resource Monitor I only get about 80,000 kN/s in SF (trying the modern version from abrok since popcnt is supposed to be faster). I have set the engine to 128 threads as this is a dual epyc 7742 (retail not ES or QS, 64cores*2). Hash is set to 8192 MB for testing purposes. SMT is disabled. 32GB of ram. I'm only getting about 80,000 kN/s.

NOTE: I've also tried enabling SMT in bios and SF is running at 100% cpu with 256 threads. But the speed only goes to about 90,000 kN/s. According to benchmarks on ipman chess amd I should be getting about 190,000 kN/s using 128 threads or 270,000 kN/s using 256 threads (with SMT enabled).

vondele commented 4 years ago

Almost certainly related to the fact you run on Win'10 Pro.

hero2017 commented 4 years ago

Possibly yes.

I thought it might be Win'10 too so I just tried Win'2012R2 and there's almost no difference in speed.

Vondele got 186,000 kN/s with 128 cores although that was using Linux. But I'm quite sure that Win'2012R2 wouldn't degrade SF this much. I was expecting ~250,000 kN/s with 256 threads, not 80,000 kN/s.

Could performance suffer so much because I'm testing with 2166MHz ram and only 4 dimm's instead of all 16 dimms populated at 3200MHz? I doubt it. I currently don't have 3200MHz ram so I can't say for sure.

vondele commented 4 years ago

no, I'm sure it is not the memory frequency.

vondele commented 4 years ago

(or almost sure.... 4dimms, I don't know).

vondele commented 4 years ago

I guess it is a bit of a stretch, but can you try with Linux installed?

hero2017 commented 4 years ago

Thanks, I may try Linux. But I'd be more curious if you also get low performance when using SF in Windows? Do you have Windows that you can quickly boot to? For the purposes of buying this hardware Linux won't work for me. So if you get 240 instead of 270 in Windows then I know 100% it's something with my system.

Btw, what motherboard do you have, and do you use all 16 dimms with 3200MHz?

vondele commented 4 years ago

no windows is not possible for me ...

hero2017 commented 4 years ago

Ok that's fine. I'll try Linux.

Could you please tell me your motherboard and if you use all 16 dimms with 3200MHz ram? I'm just trying to figure out why I have much slower speeds.

vondele commented 4 years ago

I don't know hardware details, just had remote access. The Linux test will tell a lot, I assume.

joergoster commented 4 years ago

@hero2017 Before installing and trying Linux, can you answer following questions: Are you using a GUI? If yes, which? If not, what are the exact commands you're issuing?

It would also help to know the kN/s numbers for 16T/32T/48T/64T/72T/96T/128T, so you/we would know when the expected gain of speed starts to fail. Edit: If everything is fine at 16Threads/32Threads it very likely is a NUMA/processor groups issue ... Edit 2: If everything is fine up to 64 Threads ...

hero2017 commented 4 years ago

This system has 4 numa nodes. Here are the results from Windows 10 Enterprise you requested. Bit better results but still far from the expected 250,000 kN/s range. Using Aquarium (64bit) gui though I always get about 75-80,000 kN/s at various positions, regardless if SMT is on/off:

bench 1024 128 26 Total time (ms) : 60808 Nodes searched : 8002973623 Nodes/second : 131610538

bench 1024 96 26 Total time (ms) : 68484 Nodes searched : 7202818279 Nodes/second : 105175198

bench 1024 72 26 Total time (ms) : 75230 Nodes searched : 6643786337 Nodes/second : 88312991

bench 1024 64 26 Total time (ms) : 68490 Nodes searched : 6158029504 Nodes/second : 89911366

bench 1024 48 26 Total time (ms) : 69673 Nodes searched : 4711725834 Nodes/second : 67626280

bench 1024 32 26: Total time (ms) : 72692 Nodes searched : 3750927581 Nodes/second : 51600280

bench 1024 16 26: Total time (ms) : 92690 Nodes searched : 2419194534 Nodes/second : 26099843

256 cores (SMT/HT on): Total time (ms) : 119256 Nodes searched : 17602389891 Nodes/second : 147601713 in gui running infinite analysis with 256 cores (SMT/HT on) in several positions I still only get about 75000000 which is very disappointing. I don't have this problem on my other system running dual Xeon 2696v3 (64 cores).

vondele commented 4 years ago

so that suggests that there are probably two independent effects. One is the gui, which might be doing something wrong (giving you 75000000), and the other is the outside of the gui. I suggest to first focus on outside of the gui.

Outside of the gui, the numbers are somewhat better 131610538 and 147601713, but not quite as good as the number posted in #2448 (186631199 and 259440481). Not sure if the latter difference is OS or hardware related (not all dimm slots occupied).

hero2017 commented 4 years ago

Thanks. I'm focusing on outside the gui. I even compiled SF manually using ARCH=x86-64-modern and added large pages but issue remains. As for the dimms, I just can't see SF slowing down so much just because not all dimms are being used or that they're 2166MHz. That's why I'm not ready to buy 16 x 3200MHz ram yet as that's an expensive upgrade and if it doesn't improve the speeds then it's a total waste.

However at this point I'm not sure what else to try other than Win'2019 Server but that should not make a difference.

hero2017 commented 4 years ago

I've now installed Win'2019 Std and with SMT/HT enabled I got:

bench 1024 256 26

Total time (ms) : 283712 Nodes searched : 23761859393 Nodes/second : 83753452

And this is with LP enabled. This has to be a Numa/Processor Group issue in Windows only. If it helps you to know, Win'2019 is reporting 4 processors even though I only have 2. This is likely because SMT is on and I don't think SF is taking advantage of it all.

MichaelB7 commented 4 years ago

Your numbers are worse than a 3970x running Windows 10 Pro. I’m curious as to how much a system like that would cost - if you are in a position to share. Not the exact amount - just a ball park number. Linux is faster than Windows on the 3970x.

MichaelB7 commented 4 years ago

Windows 10 Pro on a 3970X with Large Pages enabled cur-dev-Stockfish ( forked, largepages code added)

Total time (ms) : 57605 Nodes searched : 5609968834 Nodes/second : 97386838

 Stockfish-dev-040820 b  1024 64 26 >/dev/null
info string Hash LargePages 256 Mb
info string Hash LargePages 1024 Mb

Position: 1/47

profound value: 0

Nodes/Second: 85071k

Position: 2/47

profound value: 0

Nodes/Second: 92142k

Position: 3/47

profound value: 0

Nodes/Second: 140113k

Position: 4/47

profound value: 0

Nodes/Second: 101810k

Position: 5/47

profound value: 0

Nodes/Second: 90468k

Position: 6/47

profound value: 0

Nodes/Second: 91478k

Position: 7/47

profound value: 0

Nodes/Second: 96036k

Position: 8/47

profound value: 0

Nodes/Second: 98936k

Position: 9/47

profound value: 0

Nodes/Second: 89310k

Position: 10/47

profound value: 0

Nodes/Second: 104962k

Position: 11/47

profound value: 0

Nodes/Second: 86187k

Position: 12/47

profound value: 0

Nodes/Second: 86914k

Position: 13/47

profound value: 0

Nodes/Second: 97420k

Position: 14/47

profound value: 0

Nodes/Second: 89064k

Position: 15/47

profound value: 0

Nodes/Second: 104003k

Position: 16/47

profound value: 0

Nodes/Second: 104655k

Position: 17/47

profound value: 0

Nodes/Second: 130053k

Position: 18/47

profound value: 0

Nodes/Second: 154036k

Position: 19/47

profound value: 0

Nodes/Second: 158077k

Position: 20/47

profound value: 0

Nodes/Second: 143797k

Position: 21/47

profound value: 0

Nodes/Second: 175781k

Position: 22/47

profound value: 0

Nodes/Second: 176506k

Position: 23/47

profound value: 0

Nodes/Second: 192224k

Position: 24/47

profound value: 0

Nodes/Second: 130770k

Position: 25/47

profound value: 0

Nodes/Second: 180774k

Position: 26/47

profound value: 0

Nodes/Second: 123887k

Position: 27/47

profound value: 0

Nodes/Second: 135375k

Position: 28/47

profound value: 0

Nodes/Second: 118885k

Position: 29/47

profound value: 0

Nodes/Second: 110459k

Position: 30/47

profound value: 0

Nodes/Second: 133671k

Position: 31/47

profound value: 0

Nodes/Second: 105537k

Position: 32/47

profound value: 0

Nodes/Second: 94443k

Position: 33/47

profound value: 0

Nodes/Second: 90062k

Position: 34/47

profound value: 0

Nodes/Second: 91027k

Position: 35/47

profound value: 0

Nodes/Second: 119750k

Position: 36/47

profound value: 0

Nodes/Second: 156672k

Position: 37/47

profound value: 0

Nodes/Second: 142474k

Position: 38/47

profound value: 0

Nodes/Second: 140494k

Position: 39/47

profound value: 0

Nodes/Second: 140607k

Position: 40/47

profound value: 0

Nodes/Second: 140212k

Position: 41/47

profound value: 0

Nodes/Second: 149069k

Position: 42/47

profound value: 0

Nodes/Second: 128384k

Position: 43/47

profound value: 0

Nodes/Second: 66967k

Position: 44/47

profound value: 0

Nodes/Second: 73393k

Position: 45/47
Nodes/Second: 0k

Position: 46/47
Nodes/Second: 0k

Position: 47/47

profound value: 0

Nodes/Second: 81817k

===========================
Total time (ms) : 57605
Nodes searched  : 5609968834
Nodes/second    : 97386838

hero2017 commented 4 years ago

Your numbers are worse than a 3970x running Windows 10 Pro. I’m curious as to how much a system like that would cost - if you are in a position to share. Not the exact amount - just a ball park number. Linux is faster than Windows on the 3970x.

Yes no kidding. I'm in Canada so the price was even worse but after duties, taxes, and all accessories to build this system it's about $10,000 cad, not including ram yet. Obviously I didn't spend this much to get 171000 kN/s or even worse, 80000 kN/s in Windows which is the primary purpose of this build. So as you can see without the help of the stockfish team I'm screwed with this system. If it helps I'd be willing to provide access to it so the team can add/update any code for numa/processor groups and do testing.

I installed Linux CENTOS (8.1). Downloaded Stockfish Linux Modern (abrok):

./stockfish_20040717_x64_modern bench 1024 256 26 Total time (ms): 74173 Nodes searched: 12695836303 Nodes/second: 171165198

An improvement but still far from 250000 kN/s which is about what I should get with SMT enabled.

vondele commented 4 years ago

on the linux install can you cat /proc/cpuinfo ?

hero2017 commented 4 years ago

Sure, it's quite large so I've attached it here: info.txt This is with SMT/HT enabled (256 threads)

hero2017 commented 4 years ago

Below is Linux with SMT/HT disabled. Basically no difference between 128 and 256 threads in terms of nps BUT total time is much lower than with HT enabled, 48626 HT off compared to 74175 HT on:

./stockfish_20040717_x64_modern bench 1024 128 26 Total time (ms) : 48626 Nodes searched : 7971090143 Nodes/second : 163926503

MichaelB7 commented 4 years ago

@hero2017 I hope it gets resolved to your satisfaction. I believe 240M/250M nps should be within reach under Linux.

hero2017 commented 4 years ago

I don't think there's anything more for me to try. My only hope is that it's because I'm using 2166MHz ram. I spent so much money I might as well fork out another $2K for 16*3200MHz ram and pray this makes a big difference.

Other than that I take you don't see me doing anything wrong with my testing under Linux. Curious, how do I get the latest SF dev with LP so that I can try testing that, perhaps without me compiling under Linux?

noobpwnftw commented 4 years ago

@hero2017 If you do not disable THP(transparent huge pages) under Linux, then it should transparently make the applications use large pages if possible. Might need to let the engine run for a while before the page migrations take place.

vondele commented 4 years ago

so, I asked once around, populating only 4 dimms slots out of 16 will be bad for bandwidth, probably 1/4 only (the dual socket epyc 7742 should have 16 memory channels, and you might be using only 4, IIUC). Additionally the frequency is a further factor 2/3. So you might be at 1/6 of memory bandwidth of the system I tested. I don't exactly know how that impacts performance of SF, but it could have a significant impact.

hero2017 commented 4 years ago

@noobpwnftw I didn't disable anything. New CentOS, downloaded Linux engine and ran benchmark. That's it.

@vondele, @noobpwnftw But here's what I learned since. I've installed Win'10 Enterprise again. Now I get about 180,000 kN/s with LP enabled which still isn't 250,000 kN/s but that could be because I'm only using 8 of 16 2166MHz dimms (4 channel) instead of 16 dimms @ 3200MHz (8 channel) but I get this speed only in Aquarium gui. When I quit the gui and double-click the SF exe and run bench 1024 256 26 I always get only about 50,000 kN/s. Could there be something wrong with the bench section of the code, maybe not using the threads properly or numa or something?

hero2017 commented 4 years ago

Guys, in Aquarium I'm now getting about 180,000 kN/s with HT and LP enabled (although I don't see any difference with LP disabled) with latest SF dev using Win'10 Ent for many positions.

The reason why my bench results were so poor was because I was running them with 1024 MB hash. Check this out, with LP enabled and 32GB hash:

=========================== Total time (ms) : 54464 Nodes searched : 11764651729 Nodes/second : 216007853

Now with asmFish:

asmFishW_2017-05-22_popcnt setoption name largepages value true bench 32768 256 26 bench hash 32768 threads 256 depth 26 realtime 0 info string hash set to 32768 MB page size 2048 KB 1: nodes: 1099654648 256210 knps 2: nodes: 3541528719 281722 knps 3: nodes: 52490671 293243 knps 4: nodes: 821501556 296892 knps 5: nodes: 1108391488 274354 knps 6: nodes: 931360048 270037 knps 7: nodes: 639670632 280311 knps 8: nodes: 3933248779 289379 knps 9: nodes: 1844261779 274117 knps 10: nodes: 443375274 300186 knps 11: nodes: 1943369726 276086 knps 12: nodes: 3372899847 266569 knps 13: nodes: 515207197 295416 knps 14: nodes: 4767216586 275243 knps 15: nodes: 585526534 294086 knps 16: nodes: 654540305 339668 knps 17: nodes: 84681061 320761 knps 18: nodes: 226750881 284505 knps 19: nodes: 108675223 275127 knps 20: nodes: 1349868025 329638 knps 21: nodes: 68275741 254760 knps 22: nodes: 133301813 224413 knps 23: nodes: 513561672 281866 knps 24: nodes: 736776192 297446 knps 25: nodes: 3757645 163375 knps 26: nodes: 23393845 311917 knps 27: nodes: 48054093 233272 knps 28: nodes: 583293177 331228 knps 29: nodes: 369188513 285528 knps 30: nodes: 83286534 302860 knps 31: nodes: 20867790 198740 knps 32: nodes: 15393663 181101 knps 33: nodes: 8694542 189011 knps 34: nodes: 24335423 236266 knps 35: nodes: 14902997 186287 knps 36: nodes: 13943550 217867 knps 37: nodes: 14931262 261951 knps

Total time (ms) : 108936 Nodes searched : 30700177431 Nodes/second : 281818475

Quite a big difference! That's about 30% faster. Could it be a NUMA/processor groups bug in SF with this particular system? Btw, I also noticed that with asmFishW I get 3.25 GHz per node and with SF I only get 2.70 GHz per node.

Lastly, if HT is enabled SF reports 5% faster speeds than without HT. However, with HT enabled asmFish reports 30% faster speeds than without HT.

hero2017 commented 4 years ago

Was asked for Linux results earlier so here they are from Ubuntu 18.04 on this machine using today's SF which I compiled on this machine:

(SMT enabled in bios)

./stockfish bench 1024 256 26
===========================
Total time (ms) : 94497
Nodes searched  : 11660727471
Nodes/second    : 123397858

This was from the 2nd run. The 1st run gave me only 98000000.

# uname -r
5.3.0-46-generic

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

# cat /proc/cpuinfo
processor       : 255
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 49
model name      : AMD EPYC 7742 64-Core Processor
stepping        : 0
microcode       : 0x8301034
cpu MHz         : 1500.058
cache size      : 512 KB
physical id     : 1
siblings        : 128
core id         : 63
cpu cores       : 64
apicid          : 255
initial apicid  : 255
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 4493.31
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]

If I disable SMT (HT) in bios I get about 171,000,000 nps so it seems that with HT SF really suffers on this machine.

Let me know what else I can do to help resolve the slow speed issue on this machine, especially with HT enabled, even under Ubuntu. I have Win'10 Ent and Ubuntu 18.04 dual booting so I can run tests in either OS easily.

vondele commented 4 years ago

is the speed issue now resolved? If so, please, close this issue. I'll close #2448 as a duplicate

vondele commented 4 years ago

I'll consider this fixed as #2656 gets merged.

hero2017 commented 4 years ago

Wow. I have waited years for Large Pages support in SF. I ran a quick test between this version with LP and the next version without LP and the difference is night and day on this dual epyc 7742 system.

On my previous hardware in the years past, LP usually provided a nice 10-15% speed boost. On my current hardware, with 4 NUMA nodes and 256 threads I'm getting an awesome boost:

stockfish_20051319_x64_modern.exe (no LP): bench 32768 256 26

Total time (ms) : 120004 Nodes searched : 13459948950 Nodes/second : 112162502

stockfish_20051320_x64_modern.exe (LP): bench 32768 256 26

Total time (ms) : 44930 Nodes searched : 9304262065 Nodes/second : 207083509

Are you seeing this? That's a 85% speedup on this puppy. Awesome job SF team, especially vondele which I could kiss right now :-) I hope LP is never removed from SF, regardless of any obstacles it may cause for some.

With BuildTester which is a better bench tool, using only 256 MB hash, 256 threads, and depth 18 I got a cool 69% speedup. With 32GB hash, 256 threads but depth 18 instead 26 I got a 50% speedup. This is unheard of compared to any hardware I've used SF on before, including the dual Xeon E5-2696v3 where about 12% was the usual speedup. And this is with 2133MHz DD4 64GB ram so I suspect and hope it will be even faster whenever I get 3200MHz ram.

samer707 commented 4 years ago

hello i am samer707 this version not worked in chesspartner software

samer707 commented 4 years ago

Author: Sami Kiminki Date: Wed May 13 20:57:47 2020 +0200 Timestamp: 1589396267 not worked in chesspartner software

vondele commented 4 years ago

@samer707 we'll track that in the other issue. Just keep all info there.

official-stockfish / Stockfish

Stockfish much slower speeds than expected #2619

stockfish_20051319_x64_modern.exe (no LP): bench 32768 256 26

stockfish_20051320_x64_modern.exe (LP): bench 32768 256 26