Closed gvreuls closed 3 years ago
This commit https://github.com/glinscott/fishtest/commit/136f7e01f48f43988e9b4cd53e390a958cfbc73e dropped the avx512 target, please mind the commit message "Detect vnni256 targets Adjust to master Makefile with vnni256, skip avx512 and vnni512 as found to be slower in hyperthreaded execution on hardware up to cascade lake."
The new vnni256 target requires all these enabled flags from g++ -Q -march=native --help=target
: -mavx512vnni -mavx512dq -mavx512f -mavx512bw -mavx512vl
@gvreuls This is intentional. When all threads are running AVX2 gives more performance than AVX512 due to down clocking. See https://github.com/official-stockfish/Stockfish/pull/3038
@mstembera Well I don't have VNNI, just plain AVX512 (SkyLake-X), and performance dropped for me by a few percent while CPU temperature rose about 10C since switching to AVX2. If we don't go back to AVX512 I'll have to downclock my system manually and lose still more performance because my CPU is running too hot now (and it's the only AVX512 box on fishtest).
If you don't want to re-enable AVX512 then please re-enable custom_make.txt because as things are right now you're forcing me to use sub-optimal options for my box.
can you post performance numbers for a bench with avx2 and avx512, single threaded, multithreaded with threads=cores and threads=hyperthreads? If we additionally have a way to enable it based on what gcc detects (output of -march=native --help -Q, uniquely identifying the architecture), we can easily enable it. So far we have seen no architecture where it is faster, only slower.
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
stepping : 4
microcode : 0x2006906
cpu MHz : 1199.996
cache size : 11264 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple ept_mode_based_exec tsc_scaling
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7200.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
gcc -march=native -Q --help=target
The following options are target specific:
-m128bit-long-double [disabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [enabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= short
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mandroid [disabled]
-march= skylake-avx512
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [disabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [enabled]
-mavx512cd [enabled]
-mavx512dq [enabled]
-mavx512er [disabled]
-mavx512f [enabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [enabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 0
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [enabled]
-mclzero [disabled]
-mcmodel= 32
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= 387
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [enabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= none
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 0
-mrtd [disabled]
-mrtm [enabled]
-msahf [enabled]
-msgx [disabled]
-msha [disabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [disabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [disabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtune-ctrl=
-mtune= skylake-avx512
-muclibc [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [disabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 1 > /dev/null
Performance counter stats for 'system wide' (10 runs):
7.984.399.248 cycles:u ( +- 2,88% )
13.462.954.663 instructions:u # 1,69 insn per cycle ( +- 4,38% )
1,8676 +- 0,0111 seconds time elapsed ( +- 0,59% )
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 1 > /dev/null
Performance counter stats for 'system wide' (10 runs):
7.953.320.884 cycles:u ( +- 0,22% )
13.670.898.836 instructions:u # 1,72 insn per cycle ( +- 0,08% )
1,87500 +- 0,00787 seconds time elapsed ( +- 0,42% )
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 8 > /dev/null
Performance counter stats for 'system wide' (10 runs):
28.078.867.819 cycles:u ( +- 1,05% )
41.849.523.695 instructions:u # 1,49 insn per cycle ( +- 1,19% )
0,99885 +- 0,00802 seconds time elapsed ( +- 0,80% )
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 8 > /dev/null
Performance counter stats for 'system wide' (10 runs):
28.146.907.933 cycles:u ( +- 1,28% )
44.624.621.772 instructions:u # 1,59 insn per cycle ( +- 1,35% )
1,0039 +- 0,0112 seconds time elapsed ( +- 1,12% )
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 16 > /dev/null
Performance counter stats for 'system wide' (10 runs):
60.870.155.365 cycles:u ( +- 1,83% )
61.932.104.319 instructions:u # 1,02 insn per cycle ( +- 1,79% )
1,1273 +- 0,0173 seconds time elapsed ( +- 1,54% )
sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 16 > /dev/null
Performance counter stats for 'system wide' (10 runs):
65.112.791.024 cycles:u ( +- 1,75% )
68.054.836.804 instructions:u # 1,05 insn per cycle ( +- 1,72% )
1,1835 +- 0,0172 seconds time elapsed ( +- 1,45% )
It suspect that processors with very large core counts suffer from down clocking more severely than mainstream ones. https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency shows a sample chart for a 14 core cpu. Unfortunately the link to more cpu's doesn't have everything.
One thing I forgot to mention: my system downclocks AVX2 as well as AVX512 by the same 300 MHz. Strangely enough it runs stockfish much cooler in AVX512 mode than in AVX2 mode (the aforementioned 10C).
So it's 300MHz regardless of the number of threads being used? Is that something you manually set in the BIOS?
Are you adding -march=native to the compiler options?
@mstembera It's an ASUS TUF board, when I run the automatic optimizer the only thing it changes is that it sets the CPU Core Ratio to "by Specific Core". This adds 300MHz to the clock and sets the AVX512 downclock from 500 MHz to 300 MHz. This isn't considered overclocking BTW, the clock speed is still 200 MHz below the spec maximum.
@vondele I compiled them exactly as they would on fishtest. I can repeat the perf runs with -march=native if you want me to, but I doubt it will bring much.
no need to add '-march=native', it would have explained why an avx2 compile downclocks, if the compiler adds some avx512 elsewhere.
@mstembera Sorry if this annoys you, I ran that board optimizer quite some time ago when I last updated the BIOS and didn't bother to check what it changed exactly until I rebooted, set back things to default and ran the optimizer again just now.
I'm not annoyed nor criticizing your setup. Just trying to understand it better. Even though avx512 is really only relevant to Skylake(older doesn't support it and newer has vnni) we may have to accept that depending on other factors it may or may not be faster than avx2. Not sure how to best decide for fishtest.
avx512 enabled by #861
Since the upgrade to version 85 my worker is incorrectly reported as an AVX2/BMI2 machine, before the upgrade it was correctly detected as an AVX512 box.
In this test you can see how the worker switches from AVX512 before the update to AVX2/BMI2 afterwards: https://tests.stockfishchess.org/tests/view/5f494bea3def640786115336 (Look for the gvreuls worker, it used to be only AVX512 worker on fishtest as far as I'm aware of.)