Open lukego opened 9 years ago
your link to pmu-tools doesn't work
@nickdesaulniers good catch! fixed.
Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.
Fun fact: the Mill processor is capable of executing over 30 instructions per cycle, each cycle, in general purpose workloads. Sure, it's not a shipping product just yet, but an interesting architecture for the future indeed.
+1 thanks for this, very interesting stuff
Hi there!
Here is a simple exercise to connect the theory and practice of tracing JITs and modern Intel microarchitectures. I write a small example program, see how LuaJIT compiles it to a trace, and then see how a Haswell CPU executes it. This follows on from #5 and #3 respectively.
Tracing JIT
The program is trivially simple: it uses the Snabb Switch
counter
module to create a counter object and then increment that one billion times. Snabb Switch counters are represented as binary files on disk that each contain one 64-bit number (each file is 8 bytes). The reason we allocate counters on the file system is to make them directly available to diagnostic programs that are tracking network packets processed, packets dropped, and so on. The way we actually access them in Lua code is by mapping them into memory withmmap()
and then accessing them directly as FFIuint64_t *
values. (See theshm
module for our cute little API to allocate arbitrary C data types as named shared memory objects.)Here is the code:
I run this using snsh (snabb shell, a LuaJIT frontend) with JIT trace dumping enabled:
which outputs a full dump (bytecode, intermediate representation, and x86 machine code) from which we can look at the machine code for the loop that will execute one billion times:
There we see that LuaJIT has compiled the loop body down to five instructions:
This seems pretty nice actually: according to the semantics of Lua the call to
counter.add()
is actually a hashtable lookup and a function call but LuaJIT has been able to optimize this away and inline the call into two instructions. (Hat tip to Mike Pall and his very impressive brain.)So that is what the tracing JIT does!
Haswell CPU
Now what does the Haswell CPU do with this?
First the theory: we can refer to the excellent AnandTech article to see how each Haswell CPU core works:
The CPU takes in a large number of x86 instructions, JITs them all into internal Haswell micro-instructions, figures out their interdependencies, and schedules them for parallel execution across eight independent execution units. (This is a sophisticated piece of technology.)
To connect this with practice we will use the
ocperf.py
program from pmu-tools to access some CPU performance counters. Performance counters give us visbility into the internal workings of the CPU: a modern Xeon exports a lot of diagnostic information and is very far from a black box.I test with a Xeon E5-2620 v3 and this command:
So what does this mean?
instructions
. This makes sense because we counted five instructions in the loop body and we chose an iteration count of one billion.cycles
. Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.Cool stuff!
The end
This is the level of visibility that I want to have into the programs I am working on. I am quite satisfied with this example. Now what I want to do is make it easy for Snabb Switch hackers to get this level of visibility into the practical code that they are working on.