AVX-512 Broken? - Githubissues

kmcdermo commented 6 years ago

For some reason, even with all the latest fixes protecting against NaNs, we see a serious loss of hits/track on phi2 (KNL) with AVX-512 enabled (at least with max nThreads): https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/PlotsFromDump/CMSSW_TTbar_PU70_CE_nHits.png

This is true for BH, STD, CE. This most likely explains the enormous vectorization speedup on phi2: https://kmcdermo.web.cern.ch/kmcdermo/catching-nans-4-pt2/Benchmarks/KNL_CMSSW_TTbar_PU70_VU_speedup.png

A first test would be to make the same plot with nTh=1, to isolate multithreading from AVX-512, and perhaps making the same plot for nVU=2,4,8 nTh=1.

If I remember correctly, @slava77 reported seeing lots of new NaNs with max vectorization previously that @osschar did not see, but perhaps this was because we focused efforts on phiphi and not phi2.

kmcdermo commented 6 years ago

So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board.

I ran the following tests on each platform:

nVU=1, nTH=1
nVU=2, nTH=1
nVU=4, nTH=1
nVU=8, nTH=1
nVU=16, nTH=1
nVU=16+intrinsics, nTH=1
nVU=16+intrinsics, nTH=$max_th

Where $max_th is:

SKL-Ag: 48
SKL-Au: 64
KNL: 256

And in all three nVU=16 tests... the number of hits is drastically different compared to the other vector widths for all platforms.

knl_cmssw_ttbar_pu70_ce_nhits skl-ag_cmssw_ttbar_pu70_ce_nhits skl-au_cmssw_ttbar_pu70_ce_nhits

IHateLinus commented 6 years ago

Hi Kevin, I think you said in a mtg that this is a “new development” and that the timing and performance numbers we have been talking about (e.g. ~80Hz for ttbar+70PU on fully loaded KNL), were derived when this hit distribution was “healthy”

Can you please confirm and attach those plots so we establish a baseline and also make sure wr are not fulling ourselves (and others) by those earlier values…

Thanks, Avi.

On May 7, 2018, at 9:34 PM, Kevin McDermott notifications@github.com wrote:

So, I ran some tests on SKL-Ag (Cornell Silver, lnx4108), SKL-Au (UCSD Gold, phi3), and KNL (phi2), and it appears it is truly that running at full vector width is broken across the board.

I ran the following tests on each platform:

nVU=1, nTH=1 nVU=1, nTH=1 https://user-images.githubusercontent.com/7645989/39738149-7be8b5d4-5257-11e8-9207-42603fb950fd.png https://user-images.githubusercontent.com/7645989/39738150-7bfae18c-5257-11e8-88de-f72ae0c268d3.png https://user-images.githubusercontent.com/7645989/39738151-7c1331d8-5257-11e8-9a56-90938b8aee3f.png — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cerati/mictest/issues/139#issuecomment-387281590, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdT8UGCaJjttyHyYk19Na8GMpAnajz1ks5twSBugaJpZM4T1sEj.

kmcdermo commented 6 years ago

@IHateLinus , yes, to be sure, this does not affect the numbers @slava77 reported in his email from March 19 (which was using PR135).

The plots for pr135 are here: https://kmcdermo.web.cern.ch/kmcdermo/pr135/

With the relevant plot here:

So as you can see, we are "safe" for Slava's numbers at 80Hz. We can rest a bit easy in this regard.

kmcdermo commented 6 years ago

(and because I was tired of looking for this email I am posting it here as a pdf dump) fully_loaded_knl.pdf

osschar commented 6 years ago

Thanks Kevin! Good to know it's not only intrinsics that are broken. I'm looking into this now.

osschar commented 6 years ago

OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem.

I told y'all that I haven't changed anything that could warrant changes in vectorization performance :)

I'll leave it to original committer to figure out the correct fix ;)

Debugging VU=16 performance problem

Working on phi3.

mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin

gcc-7, default opts:

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702

icc, default opts (-O2 for ConformalFitter); or -O2 for all

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174 nH >= 80% =3 in pT 10%=0 in pT 20%=2

icc with -xCORE-AVX512 (instead of xCOMMON)

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702

srlantz commented 6 years ago

Cool discovery--or rather, cool clue. Obviously we wouldn't want the choice of instruction set to change the results like this.

It may simply be the case that the CORE option is letting us avoid a compiler bug that is only present in the COMMON path. Did you try giving the COMMON option to different compiler versions, perhaps on other machines? Because if that works, then for sure we're looking at a compiler bug.

Otherwise... I'm not sure whether it helps, but in case it does, the difference between -xCOMMON-AVX512 and -xCORE-AVX512 is that the latter adds the following groups of instructions: *

· AVX512BW - Byte and word (8 and 16-bit) operations that enhance integer operations. (They make use of all 64 bits in the mask registers, too.)

· AVX512DQ - Doubleword and quadword (32 and 64-bit) operations that enhance integer and floating-point operations.

· AVX512VL - Vector Length Extensions. They provide for most AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. The use of Vector Length Extensions extends most AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. It also permits access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

Thus with the CORE flag the compiler will vectorize loops that involve more types such as int and long long; more operations such as bit manipulations; and shorter vectors.

Steve

*Abridged descriptions from https://software.intel.com/en-us/blogs/additional-avx-512-instructions

From: Matevž Tadel [mailto:notifications@github.com] Sent: Tuesday, May 08, 2018 12:38 PM To: cerati/mictest Cc: Subscribed Subject: Re: [cerati/mictest] AVX-512 Broken? (#139)

OK, see my notes below, the problem is xCOMMON-AVX512, using xCORE solves the problem.

I told y'all that I haven't changed anything that could warrant changes in vectorization performance :)

I'll leave it to original committer to figure out the correct fix ;)

Debugging VU=16 performance problem

Working on phi3.

mkFit/mkFit --cmssw-n2seeds --build-ce --geom CMS-2017 --start-event 1 --num-events 1 --quality-val --input-file /data2/slava77/samples/2017/pass-4874f28/initialStep/PU70/10224.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017PU_GenSimFullINPUT+DigiFullPU_2017PU+RecoFullPU_2017PU+HARVESTFullPU_2017PU/memoryFile.fv3.clean.writeAll.recT.072617.bin

gcc-7, default opts:

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702

icc, default opts (-O2 for ConformalFitter); or -O2 for all

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=24 in pT 10%=19 in pT 20%=22 no_mc_assoc=1174 nH >= 80% =3 in pT 10%=0 in pT 20%=2

icc with -xCORE-AVX512 (instead of xCOMMON)

Read complete, 9362 simtracks on file. Building tracks with 'runBuildingTestPlexCloneEngine', total simtracks=9362 found tracks=728 in pT 10%=680 in pT 20%=723 no_mc_assoc=470 nH >= 80% =707 in pT 10%=659 in pT 20%=702

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/cerati/mictest/issues/139#issuecomment-387464786, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHooysOpfvVKLtB2A6u2JOGIDbKbhjCEks5twcoEgaJpZM4T1sEj.

osschar commented 6 years ago

Kevin ran on phi2 (18.0.1), phi3 (18.0.2) and Cornell SKL (???) - all had the same issue.

I remember getting a mail that a new intel compiler version is available, let me check. Oh, it says 2017, update 7.

I think you're right and we're looking at a compiler bug/issue here.

dan131riley commented 6 years ago

Weird. '-xCOMMON-AVX512' was fine before your changes, so it's the combination of our changes. Appears to be something in MkFinder.

dan131riley commented 6 years ago

I guess the expedient thing to do is '-xHost'. The compiler does produce a lot of different instruction sequences with common vs. core, especially for the copy routines...consider this, which looks like an example of what Steve was talking about...common first, not vectorized:

        movl      (%rax,%r8,8), %r11d                           #98.24
        movl      %r11d, (%r10,%r9)                             #98.10
        movl      4(%rax,%r8,8), %r11d                          #98.24
        incq      %r8                                           #96.7
        movl      %r11d, 64(%r10,%r9)                           #98.10
        addq      $128, %r10                                    #96.7
        cmpq      %rdi, %r8                                     #96.7
        jb        ..B1.10       # Prob 64%                      #96.7

vs. core:

        vpcmpeqb  %xmm0, %xmm0, %k1                             #98.10
        lea       (%rdi,%r11), %r14                             #98.10
        vpcmpeqb  %xmm0, %xmm0, %k2                             #98.10
        vmovups   (%rax,%r9,4), %ymm1                           #98.24
        vmovups   32(%rax,%r9,4), %ymm2                         #98.24
        vscatterdps %ymm1, (%r14,%ymm0,8){%k1}                  #98.10
        vscatterdps %ymm2, 512(%r14,%ymm0,8){%k2}               #98.10
        addq      $16, %r9                                      #96.7
        addq      $1024, %r11                                   #96.7
        cmpq      %r10, %r9                                     #96.7
        jb        ..B1.10       # Prob 82%                      #96.7

osschar commented 6 years ago

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

dan131riley commented 6 years ago

It'll be in my pull request adding phi3 to the benchmarks, coming soon.

srlantz commented 6 years ago

Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines. Or did that change? If I'm right, -xHost wouldn't be compatible with how our benchmarks get run.

From: Dan Riley [mailto:notifications@github.com] Sent: Tuesday, May 08, 2018 3:25 PM To: cerati/mictest Cc: Steve Lantz; Comment Subject: Re: [cerati/mictest] AVX-512 Broken? (#139)

It'll be in my pull request adding phi3 to the benchmarks, coming soon.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/cerati/mictest/issues/139#issuecomment-387514460, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHooytsuFqiUS3vcdyjRv7s5-MyUPiSlks5twfESgaJpZM4T1sEj.

osschar commented 6 years ago

I think Kevin changed that to build on each machine. And we dropped KNC.

pwittich commented 6 years ago

we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future.

IHateLinus commented 6 years ago

that is very sad. Cant you pit an iron ball on his leg?

On May 8, 2018, at 12:32 PM, Peter Wittich notifications@github.com wrote:

we need to find someone besides Kevin to run these tests -- Kevin will graduate not too far in the future.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cerati/mictest/issues/139#issuecomment-387516592, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdT8U6O_VslRh-27HSE01IhDIjEqksAks5twfLqgaJpZM4T1sEj.

kmcdermo commented 6 years ago

VEC_ICC := -xHost works on phi3. What do we do? Make the change and ask Kevin nicely to run the tests again?

I can do this for sure, but let's wait to see what Dan comes back with for the new benchmarking scripts :), since he is addressing this issue. The plots above were a one-off and overly pedantic for general benchmarking, but can be made easily again in case of need for debugging -- maybe I will add them to a dedicated script...

Don't the benchmark scripts do all their compiling on one machine and ship binaries around to be run on other machines.

Matevz is correct: once we dropped KNC, we tar up the working directory and ship it to the native platform and compile it there.

This is the case for phiphi (SNB), where the benchmarks are launched, and also shipped to phi2 (KNL). With phi3/lnx4108, we could tar and compile on each machine, since they both have the intel compiler. Let's see what Dan has in the new benchmarking scripts for the skylakes :).

Cant you pit an iron ball on his leg?

Don't know if I can get through airport security with it though!

dan131riley commented 6 years ago

-xHost works fine for phi3, but phi2 still hits the bug!

cmssw_ttbar_pu70_ce_nhits

dan131riley commented 6 years ago

This seems to work around the problem:

diff --git a/mkFit/MkFinder.cc b/mkFit/MkFinder.cc
index 29e0ae7..7376935 100644
--- a/mkFit/MkFinder.cc
+++ b/mkFit/MkFinder.cc
@@ -228,7 +228,7 @@ void MkFinder::SelectHitIndices(const LayerOfHits &layer_of_
     }

     dphi = std::min(std::abs(dphi), L.max_dphi());
-    dq   = std::min(std::max(dq, L.min_dq()), L.max_dq());
+    dq   = clamp(dq, L.min_dq(), L.max_dq());

     qv[itrack] = q;
     phiv[itrack] = phi;

Apparently with common-avx512 or mic-avx512 the nested min/max gets vectorized incorrectly. I looked at the assembly, but there were enough differences that I couldn't identify where it went wrong.

dan131riley commented 6 years ago

kmcdermo commented 6 years ago

Are there other places that use min/max together that we should use std::clamp instead?

Also, should we replace clamp with std::clamp?

osschar commented 6 years ago

Yay, good catch! Thanks! :)

dan131riley commented 6 years ago

@kmcdermo When I added clamp() I looked for places it could be used, dunno if I got all of them. When I reviewed #137 I had even thought of flagging this case, but put it off! std::clamp is answered in #140

trackreco / mkFit

AVX-512 Broken? #139

Debugging VU=16 performance problem