llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.24k stars 11.66k forks source link

Instruction scheduling model for the KNL #26792

Open hfinkel opened 8 years ago

hfinkel commented 8 years ago
Bugzilla Link 26418
Version trunk
OS All
Blocks llvm/llvm-project#31672
CC @legrosbuffle,@topperc,@davezarzycki,@delena,@gchatelet,@igor-breger,@RKSimon,@MattPD,@naromero77,@phoebewang,@vchuravy

Extended Description

We need to have a real instruction scheduling model for the KNL. Aside from anything else, it seems reasonable to switch the default to be more like Silvermont than like Haswell, plus there is a significant amount of information here (under the "MICROARCHITECTURE" heading):

https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing

plus from Avinash's HotChips presentation:

http://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

legrosbuffle commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#48918

topperc commented 2 years ago

mentioned in issue llvm/llvm-bugzilla-archive#48001

RKSimon commented 2 years ago

mentioned in issue llvm/llvm-project#31672

RKSimon commented 3 years ago

CC'ing Nichols who admins the flang-x86_64-knl-linux buildbot and might be another person who'd be able to get the llvm-exegesis runs done to help us get a model setup.

legrosbuffle commented 3 years ago

The floating-point issues have been fixed in https://reviews.llvm.org/D90592 (floating-point state needed saving/restoring).

phoebewang commented 3 years ago

"inexact floating point result" is a very common exception for floating point operation since the intermediate result always has higher precision that it can be stored. That's why we always mask it even when we want to handle floating point exceptions. I think you should clear mxcsr bit 12 to unmask this exception.

davezarzycki commented 3 years ago

I'm running this on Linux. Now that I know that FPE is handled too, I manually ignored the JIT FPE errors until the real FPE crash happened. I've never seen a "inexact floating point result" exception. Perhaps some benchmarking division is underflowing into a denormal?

(lldb) c Process 6354 resuming Process 6354 stopped and restarted: thread 1 received signal: SIGSEGV Process 6354 stopped

Floating Point Registers: fctrl = 0x0040 fstat = 0x0000 ftag = 0x0000 fop = 0x0000 fiseg = 0x00000000 fioff = 0x00000000 foseg = 0x00000000 fooff = 0x00000000 mxcsr = 0x00000020 mxcsrmask = 0x0000ffff xmm0 = {0x00 0x00 0x00 0x00 0x00 0x88 0xc3 0x40 0x00 0x00 0x00 0x00 0x46 0x66 0x31 0x41} xmm1 = {0x00 0x00 0x00 0x00 0x00 0x00 0x18 0x40 0x00 0x00 0x00 0x00 0x46 0x66 0x31 0x41}

legrosbuffle commented 3 years ago

I cannot reproduce, if I run:

llvm-exegesis -mode=uops -opcode-name=DIV16m

I don't get a crash, but the expected (in the sense that we recover from the crash):


mode: uops key: instructions:

I'm on Linux though. Are you by any chance on Windows ? I don't know how signals work there, maybe the CrashRecoveryContext does not recover from FPE ?

legrosbuffle commented 3 years ago

I don't think I've seen that crash before, and the CrashRecoveryContext is supposed to catch SIGFPE too:

static const int Signals[] =
    { SIGABRT, SIGBUS, SIGFPE, SIGILL, SIGSEGV, SIGTRAP };

I'll have a look.

davezarzycki commented 3 years ago

It's crashing with a SIGFPE, which is why I ran it under the debugger. Given your comments, I told LLDB to ignore SIGILL and SIGSEGV; and here is the first SIGFPE crash:

(lldb)

legrosbuffle commented 3 years ago

internal SIGILLs are normal as we just run every single instruction in a sandbox and catch the signals. So runing in a debugger will likely show a lot of false positives. The tool itself should not crash though.

davezarzycki commented 3 years ago

Thanks. As of fd1c064845e598387b33ad4f548fde141f44728e the uops test is still crashing part way through:

[dave@phi ~]$ lldb -- /p/llvm/bin//llvm-exegesis -mode=uops -opcode-index=-1 (lldb) target create "/p/llvm/bin//llvm-exegesis" Current executable set to '/p/llvm/bin/llvm-exegesis' (x86_64). (lldb) settings set -- target.run-args "-mode=uops" "-opcode-index=-1" (lldb) process launch --stdout /dev/null --stderr /dev/null Process 9283 launched: '/p/llvm/bin/llvm-exegesis' (x86_64) Process 9283 stopped

(lldb) disassemble -s 0x00007ffff7fc7016 error: Failed to disassemble memory at 0x7ffff7fc7016. (lldb) p/x (unsigned long)0x7ffff7fc7016 (unsigned long) $15 = 0x3737373737373737 (lldb) disassemble JIT(0x6156870)`foo: 0x7ffff7fc7000 <+0>: movb $0x0, %al 0x7ffff7fc7002 <+2>: subq $0x8, %rsp 0x7ffff7fc7006 <+6>: movl $0x0, (%rsp) 0x7ffff7fc700d <+13>: movl $0x0, 0x4(%rsp) 0x7ffff7fc7015 <+21>: popfq

legrosbuffle commented 3 years ago

OK, the assert is fixed by https://reviews.llvm.org/rG24bf8faabd625c213e6275c7cd77d4883f564489

legrosbuffle commented 3 years ago

I'll have a look at the assert as soon as I get a chance.

davezarzycki commented 3 years ago

Back to the original topic, here is the KNL data after I forced -mcpu=knl. The latency test might be incomplete due to a crash in the tool:

https://znu.io/llvm-#26418 /latency-stdout.txt https://znu.io/llvm-#26418 /latency-stderr.txt https://znu.io/llvm-#26418 /uops-stdout.txt https://znu.io/llvm-#26418 /uops-stderr.txt

I can rerun the tests as needed if there is missing data.

davezarzycki commented 3 years ago

I was already independently fixing/testing a fix like that. Ya, that fixes the auto-detection logic and the uops run of llvm-exegesis now works without forcing the CPU type. The latency test is crashing up front but that seems unrelated:

llvm-exegesis: /home/dave/s/l/llvm/tools/llvm-exegesis/lib/SnippetGenerator.cpp:198: void llvm::exegesis::setRegisterOperandValue(const llvm::exegesis::RegisterOperandAssignment &, llvm::exegesis::InstructionTemplate &): Assertion `AssignedValue.isReg() && AssignedValue.getReg() == ROV.Reg' failed.

RKSimon commented 3 years ago

That should look like this:

case 0x86:
  CPU = "tremont";
  *Type = X86::INTEL_TREMONT;
  break;

// Xeon Phi (Knights Landing + Knights Mill):
case 0x57:
  CPU = "knl";
  *Type = X86::INTEL_KNL;
  break;
case 0x85:
  CPU = "knm";
  *Type = X86::INTEL_KNM;
  break;

Please can you confirm that this fixes the issue and I'll push the change.

davezarzycki commented 3 years ago

Looks like "tremont" being reported might just be a typo made in ea84dc9500df383b4fe07199134033f358411e59 by Craig that hasn't been reported/fixed yet:

@@ -773,193 +793,140 @@ getIntelProcessorTypeAndSubtype(unsigned Family, unsigned Model,
       break;

     case 0x57:
-      *Type = X86::INTEL_KNL; // knl
+      CPU = "tremont";
+      *Type = X86::INTEL_KNL;
       break;
davezarzycki commented 3 years ago

Indeed. KNL being misidentified has happened before (see llvm/llvm-bugzilla-archive#36619 ). Last time, Craig was able to provide a quick fix.

legrosbuffle commented 3 years ago

This is the code I'm referring to:

def SLMPfmCounters : ProcPfmCounters {
  let CycleCounter = UnhaltedCoreCyclesPfmCounter;
  let UopsCounter = PfmCounter<"uops_retired:any">;
}
...
def : PfmCountersBinding<"tremont", SLMPfmCounters>;

def KnightPfmCounters : ProcPfmCounters {
  let CycleCounter = UnhaltedCoreCyclesPfmCounter;
  let UopsCounter = PfmCounter<"uops_retired:all">;
}
def : PfmCountersBinding<"knl", KnightPfmCounters>;
def : PfmCountersBinding<"knm", KnightPfmCounters>;

It looks like llvm is detecting your cpu as "tremont" rather than "knl", so it's using SLMPfmCounters instead of KnightPfmCounters, and labeling the data with "tremont"

legrosbuffle commented 3 years ago

Wait, actually the definition for KNL looks OK. but it looks like LLVM is detecting your CPU as "tremont".

legrosbuffle commented 3 years ago

Looks like a bad pfm event mapping. libpfm, I see "uops_retired:all", but llvm/lib/Target/X86/X86PfmCounters.td has " "uops_retired:any" for slm, so you might want to try to fix that (I don't have access to a KNL to test it)

davezarzycki commented 3 years ago

And the latency pass is weirdly labeling KNL as "tremont". For example:


mode: latency key: instructions:

davezarzycki commented 3 years ago

Is libpfm-4.10.1 not new enough? The uops pass of llvm-exegesis is failing completely (with kernel 5.8.16 if it matters):

invalid event attribute - cannot create event uops_retired:any llvm-exegesis error: Unable to create counter with name 'uops_retired:any'

RKSimon commented 3 years ago

You will need to build/install a recent copy of perfmon:

https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/

And then rebuild llvm-exegesis from scratch to ensure cmake sees thats its installed.

legrosbuffle commented 3 years ago

No, you should be good. Note that you might want to fork the new model from SLM though, to minimize diffs.

davezarzycki commented 3 years ago

Will the fact that LLVM considers KNL/KNM to be Haswell derived create any problems when using llvm-exegesis? I ask because KNL/KNM is actually Silvermont derived.

legrosbuffle commented 3 years ago

Instructions are here: https://llvm.org/docs/CommandGuide/llvm-exegesis.html

the tl;dr is that you can generate the data with:

llvm-exegesis -mode=latency -opcode-index=-1 llvm-exegesis -mode=uops -opcode-index=-1

Make sure to sync past 7e2ffe7a6358820c0f1511f3405d3fa8db4c46f4, as there was an issue with recent instructions.

Then, creating a full model from scratch is a significant piece of work. You can make this easier by starting with an existing model and using llvm-exegesis analysis mode to fix this detected issues.

davezarzycki commented 3 years ago

In theory, sure. How much work does this involve? Is there a command or script I can run from a recent LLVM build to generate all of the required data?

RKSimon commented 3 years ago

@​davezarzycki You mentioned on D89952 that you have access to a KNL box, would you be in a position to get llvm-exegesis running on it to see whether we'll be able to create a true scheduler model instead of just using the haswell model?

I'm not sure if libPFM has access to suitable pipe counters but uops/latency reports might already work.

llvmbot commented 5 years ago

Does anyone have access to KNL hardware that they could run llvm-exegesis uops/latency tests on?

This will require some hacking as KNL still uses the Haswell model at the moment - maybe moving to the SLM model for the tests, along with the necessary perf counter mappings (see https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/ intel_knl_events.h).

Unfortunately, the compiler explorer link I posted in the other bug was set to use KNL because I was also testing gcc5.5 in the same window at some point it it was the only "arch" options that 5.5 supported that had AVX-512 (and I wanted to see vectorization with AVX512 instructions because I have gcc5.5 on a skylake-avx512 platform). Sorry

RKSimon commented 5 years ago

Does anyone have access to KNL hardware that they could run llvm-exegesis uops/latency tests on?

This will require some hacking as KNL still uses the Haswell model at the moment - maybe moving to the SLM model for the tests, along with the necessary perf counter mappings (see https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/ intel_knl_events.h).

The perf counter mappings are now in place, so we now need someone with KNL hardware to run llvm-exegesis (with libpfm4) to get the raw schedule data.

RKSimon commented 6 years ago

Does anyone have access to KNL hardware that they could run llvm-exegesis uops/latency tests on?

This will require some hacking as KNL still uses the Haswell model at the moment - maybe moving to the SLM model for the tests, along with the necessary perf counter mappings (see https://sourceforge.net/p/perfmon2/libpfm4/ci/master/tree/lib/events/intel_knl_events.h).

hfinkel commented 8 years ago

The KNL is now covered in Intel's Architecture Optimization Manual (http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html), specifically chapter 16.

hfinkel commented 8 years ago

There's now a significant amount of published information:

Knights Landing: Second-Generation Intel Xeon Phi Product Avinash Sodani, et al. IEEE Micro. Vol. 36 (2). 2016

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7453080

Here are quotes from this article (and some of my commentary) relevant to instruction scheduling and some related issues...

"Front-end unit. The core’s FEU comprises a 32-Kbyte instruction cache (IL1) ... In case of a hit, the instruction cache can deliver up to 16 bytes per cycle. These bytes are then sent to a two-wide decoder."

It is important to note in this context that AVX-512 instructions can be up to 12 bytes long (if you have a memory operand with a non-compressed displacement). We'll need to take care not to generate displacements that require 4 bytes, and, I presume, otherwise try not to schedule long instructions next to each other.

"Most instructions are decoded into a single micro-op, ... Decoded micro-ops are placed into a 32-entry instruction queue."

"The allocation unit reads two micro-ops per cycle from the instruction queue. It assigns the necessary pipeline resources required by the micro-ops, such as reorder buffer (ROB) entries (72), rename buffer entries (72), store data buffers (16), gather-scatter table entries (4), and reservation station entries."

"After the allocation unit, the micro-ops are sent to one of three execution units—IEU, MEU, or VPU—depending on their opcode types. Some micro-ops could get sent to more than one execution unit. For example, an Add instruction that has a memory source will be sent to the MEU to read the memory source, and then to the IEU to execute the Add operation."

"Integer execution unit. The IEU executes integer micro-ops ... There are two IEUs in the core. Each IEU contains one 12-entry reservation station that issues one micro-op per cycle... Most operations have one-cycle latency and are supported by both IEUs. But a few have three- or five-cycle latency (for example, "multiplies") and are only supported by one of the IEUs."

"Memory execution unit. Up to two memory operations, either load or store, can be executed in the MEU in a given cycle. Memory operations are issued ... from the 12-entry memory reservation station. While stores are kept in the store buffer after address translation, they can forward data to dependent loads. Stores are committed to memory in the program order, one per cycle."

"L1 data cache (DL1) supports two simultaneous 512-bit reads and one 512- bit write, with a load-to-use latency of four cycles for integer loads and five cycles for floating-point loads."

"The MEU supports unaligned memory accesses without any penalties and supports accesses that split into two cache lines with a two-cycle penalty"

"Vector processing unit... provides support for x87, MMX, Streaming SIMD Extensions (SSE), AVX, and AVX-512 instructions, as well as integer divides. Two VPUs are connected to the core... the allocation unit dispatching instructions directly into the VPUs... The VPUs are mostly symmetrical, and each can provide a steadystate throughput of one AVX-512 instruction per cycle,... One of the VPUs is extended to provide support for the legacy floating-point instructions, such as x87, MMX, and a subset of byte and word SSE instructions."

"Each VPU contains a 20-entry floating-point reservation station that issues one micro-op per cycle out of order. The floating-point reservation station differs from the IEU and MEU reservation stations in that it does not hold source data ...; the floating-point micro-ops read their source data from the floating-point rename buffer and the floating-point register file after they issue from the floating-point reservation station, spending an extra cycle between the reservation station and execution compared to integer and memory micro-ops. Most floating-point arithmetic operations have a latency of six cycles, whereas the rest of the operations have a latency of two or three cycles, depending on the operation type."

"On a KNL tile, the two cores share a... L2 cache. The BIU also contains an L2 hardware prefetcher that is trained on requests coming from the cores. It supports up to 48 independent prefetch streams... A KNL core supports up to four hardware contexts"

This fact is relevant for cost modeling for higher-level loop transformations (e.g. loop fusion and fission). We want to keep relevant streaming operations below the hardware limit of (48 / 2 (cores per tile) / 4 (threads per core) == 6 per thread).

hfinkel commented 8 years ago

There's now a significant amount of published information:

Knights Landing: Second-Generation Intel Xeon Phi Product Avinash Sodani, et al. IEEE Micro. Vol. 36 (2). 2016

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7453080

jwake commented 1 year ago

I'm interested in picking up on this - I've got access to a significant quantity of KNLs through work and should have some time available to poke at them.

I'd actually made a bit of a start before spotting this issue - cloning the Silvermont scheduler model, adding in basic guesses at the pipeline elements and AVX512 opcodes from Agner Fog's instruction tables and the Intel optimisation manuals.

I've completed latency/uops runs of llvm-exegesis and I'll start resolving the inconsistencies between the guesses and the measurements soon.

RKSimon commented 1 year ago

@jwake That's awesome - is there any chance that you could make your exegesis captures available? I'm assuming you hit the same issues as with the SLM model that KNL doesn't have good PMCs to track pipe usage?

jwake commented 1 year ago

Sure - I've tarred them up here: https://drive.google.com/file/d/1OfPWpR9Jt6ZcBJsH0KM1LGfHZrqy4J_M/view?usp=share_link

I hadn't gotten quite as far as examining PMCs yet; I've not spent much time hacking on LLVM (barely any, really) so I've been using the KNL codegen issues I've been seeing compared to now-unsupported versions of ICC (amongst others, the loop vectoriser at -O3 sometimes makes some decisions that the KNL does not like, at all) as an excuse to start learning it.

RKSimon commented 1 year ago

Cheers - getting the uops counts matching exegesis is usually the best first stage - then throughput (although there's always some instructions that are poor matches) - latency is always the hardest to match.

A lot of KNL codegen issues I imagine won't be addressed through the model, but better tuning flags - for instance KNL hates SSE style shift-by-scalar instructions.

RKSimon commented 1 year ago

@jwake Have you made any more progress on modeling KNL from exegesis captures, or is there anything I can do to help?

RKSimon commented 1 year ago

@jwake Please would it be possible to get hold of your wip model and exegesis captures and I'll see if I can get this finished?