official-stockfish / fishtest

The Stockfish testing framework
https://tests.stockfishchess.org/tests
278 stars 128 forks source link

Server doesn't detect AVX512 workers correctly anymore #782

Closed gvreuls closed 3 years ago

gvreuls commented 4 years ago

Since the upgrade to version 85 my worker is incorrectly reported as an AVX2/BMI2 machine, before the upgrade it was correctly detected as an AVX512 box.

In this test you can see how the worker switches from AVX512 before the update to AVX2/BMI2 afterwards: https://tests.stockfishchess.org/tests/view/5f494bea3def640786115336 (Look for the gvreuls worker, it used to be only AVX512 worker on fishtest as far as I'm aware of.)

ppigazzini commented 4 years ago

This commit https://github.com/glinscott/fishtest/commit/136f7e01f48f43988e9b4cd53e390a958cfbc73e dropped the avx512 target, please mind the commit message "Detect vnni256 targets Adjust to master Makefile with vnni256, skip avx512 and vnni512 as found to be slower in hyperthreaded execution on hardware up to cascade lake."

The new vnni256 target requires all these enabled flags from g++ -Q -march=native --help=target : -mavx512vnni -mavx512dq -mavx512f -mavx512bw -mavx512vl

mstembera commented 4 years ago

@gvreuls This is intentional. When all threads are running AVX2 gives more performance than AVX512 due to down clocking. See https://github.com/official-stockfish/Stockfish/pull/3038

gvreuls commented 4 years ago

@mstembera Well I don't have VNNI, just plain AVX512 (SkyLake-X), and performance dropped for me by a few percent while CPU temperature rose about 10C since switching to AVX2. If we don't go back to AVX512 I'll have to downclock my system manually and lose still more performance because my CPU is running too hot now (and it's the only AVX512 box on fishtest).

gvreuls commented 4 years ago

If you don't want to re-enable AVX512 then please re-enable custom_make.txt because as things are right now you're forcing me to use sub-optimal options for my box.

vondele commented 4 years ago

can you post performance numbers for a bench with avx2 and avx512, single threaded, multithreaded with threads=cores and threads=hyperthreads? If we additionally have a way to enable it based on what gcc detects (output of -march=native --help -Q, uniquely identifying the architecture), we can easily enable it. So far we have seen no architecture where it is faster, only slower.

gvreuls commented 4 years ago

cat /proc/cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 85
model name  : Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz
stepping    : 4
microcode   : 0x2006906
cpu MHz     : 1199.996
cache size  : 11264 KB
physical id : 0
siblings    : 16
core id     : 0
cpu cores   : 8
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req md_clear flush_l1d
vmx flags   : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple ept_mode_based_exec tsc_scaling
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7200.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

gcc -march=native -Q --help=target

The following options are target specific:
  -m128bit-long-double              [disabled]
  -m16                              [disabled]
  -m32                              [disabled]
  -m3dnow                           [disabled]
  -m3dnowa                          [disabled]
  -m64                              [enabled]
  -m80387                           [enabled]
  -m8bit-idiv                       [disabled]
  -m96bit-long-double               [enabled]
  -mabi=                            sysv
  -mabm                             [enabled]
  -maccumulate-outgoing-args        [disabled]
  -maddress-mode=                   short
  -madx                             [enabled]
  -maes                             [enabled]
  -malign-data=                     compat
  -malign-double                    [disabled]
  -malign-functions=                0
  -malign-jumps=                    0
  -malign-loops=                    0
  -malign-stringops                 [enabled]
  -mandroid                         [disabled]
  -march=                           skylake-avx512
  -masm=                            att
  -mavx                             [enabled]
  -mavx2                            [enabled]
  -mavx256-split-unaligned-load     [disabled]
  -mavx256-split-unaligned-store    [disabled]
  -mavx5124fmaps                    [disabled]
  -mavx5124vnniw                    [disabled]
  -mavx512bf16                      [disabled]
  -mavx512bitalg                    [disabled]
  -mavx512bw                        [enabled]
  -mavx512cd                        [enabled]
  -mavx512dq                        [enabled]
  -mavx512er                        [disabled]
  -mavx512f                         [enabled]
  -mavx512ifma                      [disabled]
  -mavx512pf                        [disabled]
  -mavx512vbmi                      [disabled]
  -mavx512vbmi2                     [disabled]
  -mavx512vl                        [enabled]
  -mavx512vnni                      [disabled]
  -mavx512vp2intersect              [disabled]
  -mavx512vpopcntdq                 [disabled]
  -mbionic                          [disabled]
  -mbmi                             [enabled]
  -mbmi2                            [enabled]
  -mbranch-cost=<0,5>               0
  -mcall-ms2sysv-xlogues            [disabled]
  -mcet-switch                      [disabled]
  -mcld                             [disabled]
  -mcldemote                        [disabled]
  -mclflushopt                      [enabled]
  -mclwb                            [enabled]
  -mclzero                          [disabled]
  -mcmodel=                         32
  -mcpu=                            
  -mcrc32                           [disabled]
  -mcx16                            [enabled]
  -mdispatch-scheduler              [disabled]
  -mdump-tune-features              [disabled]
  -menqcmd                          [disabled]
  -mf16c                            [enabled]
  -mfancy-math-387                  [enabled]
  -mfentry                          [disabled]
  -mfentry-name=                    
  -mfentry-section=                 
  -mfma                             [enabled]
  -mfma4                            [disabled]
  -mforce-drap                      [disabled]
  -mforce-indirect-call             [disabled]
  -mfp-ret-in-387                   [enabled]
  -mfpmath=                         387
  -mfsgsbase                        [enabled]
  -mfunction-return=                keep
  -mfused-madd                      -ffp-contract=fast
  -mfxsr                            [enabled]
  -mgeneral-regs-only               [disabled]
  -mgfni                            [disabled]
  -mglibc                           [enabled]
  -mhard-float                      [enabled]
  -mhle                             [enabled]
  -miamcu                           [disabled]
  -mieee-fp                         [enabled]
  -mincoming-stack-boundary=        0
  -mindirect-branch-register        [disabled]
  -mindirect-branch=                keep
  -minline-all-stringops            [disabled]
  -minline-stringops-dynamically    [disabled]
  -minstrument-return=              none
  -mintel-syntax                    -masm=intel
  -mlarge-data-threshold=<number>   65536
  -mlong-double-128                 [disabled]
  -mlong-double-64                  [disabled]
  -mlong-double-80                  [enabled]
  -mlwp                             [disabled]
  -mlzcnt                           [enabled]
  -mmanual-endbr                    [disabled]
  -mmemcpy-strategy=                
  -mmemset-strategy=                
  -mmitigate-rop                    [disabled]
  -mmmx                             [enabled]
  -mmovbe                           [enabled]
  -mmovdir64b                       [disabled]
  -mmovdiri                         [disabled]
  -mmpx                             [disabled]
  -mms-bitfields                    [disabled]
  -mmusl                            [disabled]
  -mmwaitx                          [disabled]
  -mno-align-stringops              [disabled]
  -mno-default                      [disabled]
  -mno-fancy-math-387               [disabled]
  -mno-push-args                    [disabled]
  -mno-red-zone                     [disabled]
  -mno-sse4                         [disabled]
  -mnop-mcount                      [disabled]
  -momit-leaf-frame-pointer         [disabled]
  -mpc32                            [disabled]
  -mpc64                            [disabled]
  -mpc80                            [disabled]
  -mpclmul                          [enabled]
  -mpcommit                         [disabled]
  -mpconfig                         [disabled]
  -mpku                             [disabled]
  -mpopcnt                          [enabled]
  -mprefer-avx128                   -mprefer-vector-width=128
  -mprefer-vector-width=            none
  -mpreferred-stack-boundary=       0
  -mprefetchwt1                     [disabled]
  -mprfchw                          [enabled]
  -mptwrite                         [disabled]
  -mpush-args                       [enabled]
  -mrdpid                           [disabled]
  -mrdrnd                           [enabled]
  -mrdseed                          [enabled]
  -mrecip                           [disabled]
  -mrecip=                          
  -mrecord-mcount                   [disabled]
  -mrecord-return                   [disabled]
  -mred-zone                        [enabled]
  -mregparm=                        0
  -mrtd                             [disabled]
  -mrtm                             [enabled]
  -msahf                            [enabled]
  -msgx                             [disabled]
  -msha                             [disabled]
  -mshstk                           [disabled]
  -mskip-rax-setup                  [disabled]
  -msoft-float                      [disabled]
  -msse                             [enabled]
  -msse2                            [enabled]
  -msse2avx                         [disabled]
  -msse3                            [enabled]
  -msse4                            [enabled]
  -msse4.1                          [enabled]
  -msse4.2                          [enabled]
  -msse4a                           [disabled]
  -msse5                            -mavx
  -msseregparm                      [disabled]
  -mssse3                           [enabled]
  -mstack-arg-probe                 [disabled]
  -mstack-protector-guard-offset=   
  -mstack-protector-guard-reg=      
  -mstack-protector-guard-symbol=   
  -mstack-protector-guard=          tls
  -mstackrealign                    [disabled]
  -mstringop-strategy=              [default]
  -mstv                             [disabled]
  -mtbm                             [disabled]
  -mtls-dialect=                    gnu
  -mtls-direct-seg-refs             [enabled]
  -mtune-ctrl=                      
  -mtune=                           skylake-avx512
  -muclibc                          [disabled]
  -mvaes                            [disabled]
  -mveclibabi=                      [default]
  -mvect8-ret-in-mem                [disabled]
  -mvpclmulqdq                      [disabled]
  -mvzeroupper                      [disabled]
  -mwaitpkg                         [disabled]
  -mwbnoinvd                        [disabled]
  -mx32                             [disabled]
  -mxop                             [disabled]
  -mxsave                           [enabled]
  -mxsavec                          [enabled]
  -mxsaveopt                        [enabled]
  -mxsaves                          [enabled]

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 1 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

     7.984.399.248      cycles:u                                                      ( +-  2,88% )
    13.462.954.663      instructions:u            #    1,69  insn per cycle           ( +-  4,38% )

            1,8676 +- 0,0111 seconds time elapsed  ( +-  0,59% )

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 1 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

     7.953.320.884      cycles:u                                                      ( +-  0,22% )
    13.670.898.836      instructions:u            #    1,72  insn per cycle           ( +-  0,08% )

           1,87500 +- 0,00787 seconds time elapsed  ( +-  0,42% )

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 8 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

    28.078.867.819      cycles:u                                                      ( +-  1,05% )
    41.849.523.695      instructions:u            #    1,49  insn per cycle           ( +-  1,19% )

           0,99885 +- 0,00802 seconds time elapsed  ( +-  0,80% )

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 8 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

    28.146.907.933      cycles:u                                                      ( +-  1,28% )
    44.624.621.772      instructions:u            #    1,59  insn per cycle           ( +-  1,35% )

            1,0039 +- 0,0112 seconds time elapsed  ( +-  1,12% )

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-avx512 bench 16 16 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

    60.870.155.365      cycles:u                                                      ( +-  1,83% )
    61.932.104.319      instructions:u            #    1,02  insn per cycle           ( +-  1,79% )

            1,1273 +- 0,0173 seconds time elapsed  ( +-  1,54% )

sudo perf stat -r 10 -a -B -e cycles:u,instructions:u ./stockfish-bmi2 bench 16 16 > /dev/null

 Performance counter stats for 'system wide' (10 runs):

    65.112.791.024      cycles:u                                                      ( +-  1,75% )
    68.054.836.804      instructions:u            #    1,05  insn per cycle           ( +-  1,72% )

            1,1835 +- 0,0172 seconds time elapsed  ( +-  1,45% )
mstembera commented 4 years ago

It suspect that processors with very large core counts suffer from down clocking more severely than mainstream ones. https://stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency shows a sample chart for a 14 core cpu. Unfortunately the link to more cpu's doesn't have everything.

gvreuls commented 4 years ago

One thing I forgot to mention: my system downclocks AVX2 as well as AVX512 by the same 300 MHz. Strangely enough it runs stockfish much cooler in AVX512 mode than in AVX2 mode (the aforementioned 10C).

mstembera commented 4 years ago

So it's 300MHz regardless of the number of threads being used? Is that something you manually set in the BIOS?

vondele commented 4 years ago

Are you adding -march=native to the compiler options?

gvreuls commented 4 years ago

@mstembera It's an ASUS TUF board, when I run the automatic optimizer the only thing it changes is that it sets the CPU Core Ratio to "by Specific Core". This adds 300MHz to the clock and sets the AVX512 downclock from 500 MHz to 300 MHz. This isn't considered overclocking BTW, the clock speed is still 200 MHz below the spec maximum.

@vondele I compiled them exactly as they would on fishtest. I can repeat the perf runs with -march=native if you want me to, but I doubt it will bring much.

vondele commented 4 years ago

no need to add '-march=native', it would have explained why an avx2 compile downclocks, if the compiler adds some avx512 elsewhere.

gvreuls commented 4 years ago

@mstembera Sorry if this annoys you, I ran that board optimizer quite some time ago when I last updated the BIOS and didn't bother to check what it changed exactly until I rebooted, set back things to default and ran the optimizer again just now.

mstembera commented 4 years ago

I'm not annoyed nor criticizing your setup. Just trying to understand it better. Even though avx512 is really only relevant to Skylake(older doesn't support it and newer has vnni) we may have to accept that depending on other factors it may or may not be faster than avx2. Not sure how to best decide for fishtest.

ppigazzini commented 3 years ago

avx512 enabled by #861