Tracing JITs and modern CPUs: Part 2

lukego commented 9 years ago

Hi there!

Here is a simple exercise to connect the theory and practice of tracing JITs and modern Intel microarchitectures. I write a small example program, see how LuaJIT compiles it to a trace, and then see how a Haswell CPU executes it. This follows on from #5 and #3 respectively.

Tracing JIT

The program is trivially simple: it uses the Snabb Switch counter module to create a counter object and then increment that one billion times. Snabb Switch counters are represented as binary files on disk that each contain one 64-bit number (each file is 8 bytes). The reason we allocate counters on the file system is to make them directly available to diagnostic programs that are tracking network packets processed, packets dropped, and so on. The way we actually access them in Lua code is by mapping them into memory with mmap() and then accessing them directly as FFI uint64_t * values. (See the shm module for our cute little API to allocate arbitrary C data types as named shared memory objects.)

Here is the code:

local counter = require("core.counter")

local n = 1e9
local c = counter.open("test")
for i = 1,n do
   counter.add(c, 1)
end

I run this using snsh (snabb shell, a LuaJIT frontend) with JIT trace dumping enabled:

# ./snabb snsh -jdump script.lua

which outputs a full dump (bytecode, intermediate representation, and x86 machine code) from which we can look at the machine code for the loop that will execute one billion times:

->LOOP:
0bcaf790  add rbp, +0x01
0bcaf794  mov [rcx], rbp
0bcaf797  add ebx, +0x01
0bcaf79a  cmp ebx, eax
0bcaf79c  jle 0x0bcaf790    ->LOOP

There we see that LuaJIT has compiled the loop body down to five instructions:

Bump counter value in register.
Store counter value to memory.
Bump loop iteration counter.
Check for loop termination.
Branch back to start of loop.

This seems pretty nice actually: according to the semantics of Lua the call to counter.add() is actually a hashtable lookup and a function call but LuaJIT has been able to optimize this away and inline the call into two instructions. (Hat tip to Mike Pall and his very impressive brain.)

So that is what the tracing JIT does!

Haswell CPU

Now what does the Haswell CPU do with this?

First the theory: we can refer to the excellent AnandTech article to see how each Haswell CPU core works:

haswell microarchitecutre

The CPU takes in a large number of x86 instructions, JITs them all into internal Haswell micro-instructions, figures out their interdependencies, and schedules them for parallel execution across eight independent execution units. (This is a sophisticated piece of technology.)

To connect this with practice we will use the ocperf.py program from pmu-tools to access some CPU performance counters. Performance counters give us visbility into the internal workings of the CPU: a modern Xeon exports a lot of diagnostic information and is very far from a black box.

I test with a Xeon E5-2620 v3 and this command:

# events=instructions,cycles,branches,branch-misses,L1-dcache-stores,L1-dcache-store-misses,uops_executed_port.port_0,uops_executed_port.port_1,uops_executed_port.port_2,uops_executed_port.port_3,uops_executed_port.port_4,uops_executed_port.port_5,uops_executed_port.port_6,uops_executed_port.port_7

# ocperf.py stat -e $events ./snabb snsh script.lua

 Performance counter stats for './snabb snsh script.lua':

     4,943,623,915      instructions              #    4.93  insns per cycle         [36.34%]
     1,002,132,062      cycles                    [36.89%]
       972,186,713      branches                                                     [37.44%]
           465,760      branch-misses             #    0.05% of all branches         [38.01%]
       964,577,367      L1-dcache-stores                                             [38.09%]
           193,800      L1-dcache-store-misses                                       [29.38%]
       667,517,657      uops_executed_port_port_0                                    [28.52%]
       672,651,709      uops_executed_port_port_1                                    [27.66%]
       434,145,210      uops_executed_port_port_2                                    [27.56%]
       367,160,760      uops_executed_port_port_3                                    [29.60%]
       962,831,791      uops_executed_port_port_4                                    [29.33%]
       651,043,422      uops_executed_port_port_5                                    [29.07%]
       989,614,752      uops_executed_port_port_6                                    [28.77%]
       215,443,552      uops_executed_port_port_7                                    [28.52%]

       0.465658029 seconds time elapsed

So what does this mean?

The loop executed around 5 billion instructions. This makes sense because we counted five instructions in the loop body and we chose an iteration count of one billion.
The loop executed in 1 billion cycles. Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.
There were a billion branches but the CPU predicted them all correctly.
There were a billion memory stores but the CPU made them all hit the L1 cache.
The Haswell execution units 4 and 6 were used continuously and the CPU scheduled the rest of the load across execution units 0, 1, 2, 3, and 7. I can see what port 4 would need to be used continuously because it is the only execution core capable of Store Data but that is the limit of my interpretation.

Cool stuff!

The end

This is the level of visibility that I want to have into the programs I am working on. I am quite satisfied with this example. Now what I want to do is make it easy for Snabb Switch hackers to get this level of visibility into the practical code that they are working on.

nickdesaulniers commented 9 years ago

your link to pmu-tools doesn't work

lukego commented 9 years ago

@nickdesaulniers good catch! fixed.

radex commented 9 years ago

Holy shit! The CPU is actually executing the entire loop - all five instructions - in only one cycle. I am impressed.

Fun fact: the Mill processor is capable of executing over 30 instructions per cycle, each cycle, in general purpose workloads. Sure, it's not a shipping product just yet, but an interesting architecture for the future indeed.

ytti commented 9 years ago

+1 thanks for this, very interesting stuff

lukego / blog