lukego / blog

Luke Gorrie's blog
565 stars 11 forks source link

Tracing JITs 4: Zooming in on a simple trace #9

Open lukego opened 8 years ago

lukego commented 8 years ago

Today I want to do a simple experiment to improve my mental model of how Snabb Switch code is just-in-time compiled into traces by LuaJIT. I am continuing with simple artificial examples like the ones in #6 and #8.

This is the code I want to look at today:

local counter = require("core.counter")

local n = 1e9
local c = counter.open("test")
local t = { [0] = 1, [1] = 10 }  -- this table is added since #8
for i = 1,n do
   counter.add(c, t[i%2])
end

which is functionally equivalent to the code in #8: it loops one billion times and increments a counter alternatively by 1 and 10. The difference is that the decision of how much to increment the counter now depends on data (table lookup) rather than control (if statement). This is an optimization that I have made to help the tracing JIT: I moved the variability from control (branching) into data (table lookup).

Here is how it runs:

$ perf stat -e instructions,cycles ./snabb snsh -jdump=+r,dump.txt script3.lua
    15,023,389,554 instructions              #    3.74  insns per cycle
     4,021,039,287 cycles

Each iteration takes 4 cycles (and executes 15 instructions). This is indeed in between the version from #6 that runs in one cycle (5 instructions) and the version in #8 that runs in 15 cycles (36 instructions).

Here is what I am going to do now:

  1. Examine the machine code together with the LuaJIT Intermediate Representation (IR) for the loop.
  2. Make some simple tweaks and see what happens.

    IR and mcode

Let me present three exhibits: the Intermediate Representation for the loop, the corresponding machine code, and a hand-interleaved combination of the two with some comments.

Here is the Intermediate Representation of the loop:

0030 ------------ LOOP ------------
0031 r15      int BAND   0028  +1
0032       >  int ABC    0013  0031
0033          p32 AREF   0015  0031
0034 xmm7  >  num ALOAD  0033
0035 r15      u64 CONV   0034  u64.num
0036 rbp    + u64 ADD    0035  0025
0037          u64 XSTORE 0021  0036
0038 rbx    + int ADD    0028  +1
0039       >  int LE     0038  0001
0040 rbx      int PHI    0028  0038
0041 rbp      u64 PHI    0025  0036

Here is the machine code that was generated for that IR code:

0bcafa70  mov r15d, ebx
0bcafa73  and r15d, +0x01
0bcafa77  cmp r15d, esi
0bcafa7a  jnb 0x0bca0018        ->2
0bcafa80  cmp dword [rdx+r15*8+0x4], 0xfffeffff
0bcafa89  jnb 0x0bca0018        ->2
0bcafa8f  movsd xmm7, [rdx+r15*8]
0bcafa95  cvttsd2si r15, xmm7
0bcafa9a  test r15, r15
0bcafa9d  jns 0x0bcafaad
0bcafa9f  addsd xmm7, [0x41937b60]
0bcafaa8  cvttsd2si r15, xmm7
0bcafaad  add rbp, r15
0bcafab0  mov [rcx+0x8], rbp
0bcafab4  add ebx, +0x01
0bcafab7  cmp ebx, eax
0bcafab9  jle 0x0bcafa70        ->LOOP

Here is an interleaved combination of the two where I have added comments explaining my interpretation of what is going on:

0030 ------------ LOOP ------------

;; Calculate i%2
0031 r15      int BAND   0028  +1
  0bcafa70  mov r15d, ebx
  0bcafa73  and r15d, +0x01

;; Array Bounds Check (ABC) of t[...]
0032       >  int ABC    0013  0031
  0bcafa77  cmp r15d, esi
  0bcafa7a  jnb 0x0bca0018        ->2
  0bcafa80  cmp dword [rdx+r15*8+0x4], 0xfffeffff
  0bcafa89  jnb 0x0bca0018        ->2

;; Lookup array element location (AREF) and value (ALOAD)
0033          p32 AREF   0015  0031
0034 xmm7  >  num ALOAD  0033
  0bcafa8f  movsd xmm7, [rdx+r15*8]

;; Convert array value from double float (Lua's native number format) into a uint64_t.
0035 r15      u64 CONV   0034  u64.num
  0bcafa95  cvttsd2si r15, xmm7
  0bcafa9a  test r15, r15integ
  0bcafa9d  jns 0x0bcafaad

;; Add the value fromt he table (xmm7) to the counter value (rbp).
;;
;; XXX Is there duplicate work here with a second 'cvttsd2si r15,xmm7'?
;;     (that converts the double float in xmm7 to an integer in r15)
0036 rbp    + u64 ADD    0035  0025
  0bcafa9f  addsd xmm7, [0x40111b60]
  0bcafaa8  cvttsd2si r15, xmm7
  0bcafaad  add rbp, r15

;; Store the updated counter value to memory.
0037          u64 XSTORE 0021  0036
  0bcafab0  mov [rcx+0x8], rbp

;; Increment the loop index.
0038 rbx    + int ADD    0028  +1
  0bcafab4  add ebx, +0x01

;; Check for loop termination.
0039       >  int LE     0038  0001
  0bcafab7  cmp ebx, eax
  0bcafab9  jle 0x0bcafa70        ->LOOP

0040 rbx      int PHI    0028  0038
0041 rbp      u64 PHI    0025  0036

I may well have made a mistake in this interpretation. I am also not certain whether the machine code does strictly match the IR code or to what extent it can be merged and shuffled around. I would like to understand this better because I have a fantasy that LuaJIT could automatically generate the interleaved view and that this might make traces easier for me to read.

So what jumps out from this?

  1. The IR and the machine code do seem to match up pretty neatly. I am glad to have both because I would have had a hard time recognizing the ABC (Array Bounds Check) just from looking at the machine code.
  2. The loop is quite long: 15 instructions. That is triple the length of the simpler loop from #6.
  3. The loop executes in four cycles. That is an average of 3.75 instructions executed per cycle. I am still really impressed with Intel CPUs.
  4. There seems to be some busy-work that could be optimized away: the array bounds check and the float/integer conversions.

    Tweak 1: LUAJIT_NUMMODE

First I take the opportunity to try a little bit of voodoo. LuaJIT supports several number modes that can be chosen at compile time. What is a number mode? I don't really know. Mike Pall has commented that on x86_64 there are several options and some may be faster than others depending on the mix of integer and floating point operations.

Just for fun I tried them all. Turned out that compiling LuaJIT with -DLUAJIT_NUMMODE=2 improved this example significantly:

    12,022,432,242 instructions              #    3.98  insns per cycle
     3,020,610,875 cycles

Now we are down to 3 cycles per iteration (for 12 instructions).

Here is the IR:

0030 ------------ LOOP ------------
0031 r15      int BAND   0028  +1
0032       >  int ABC    0013  0031
0033          p32 AREF   0015  0031
0034       >  int ALOAD  0033
0035 r15      u64 CONV   0034  u64.int sext
0036 rbp    + u64 ADD    0035  0025
0037          u64 XSTORE 0021  0036
0038 rbx    + int ADD    0028  +1
0039       >  int LE     0038  0001
0040 rbx      int PHI    0028  0038
0041 rbp      u64 PHI    0025  0036

Here is the mcode:

->LOOP:
0bcafaa0  mov r15d, ebx
0bcafaa3  and r15d, +0x01
0bcafaa7  cmp r15d, esi
0bcafaaa  jnb 0x0bca0018        ->2
0bcafab0  cmp dword [rdx+r15*8+0x4], 0xfffeffff
0bcafab9  jnz 0x0bca0018        ->2
0bcafabf  movsxd r15, dword [rdx+r15*8]
0bcafac3  add rbp, r15
0bcafac6  mov [rcx+0x8], rbp
0bcafaca  add ebx, +0x01
0bcafacd  cmp ebx, eax
0bcafacf  jle 0x0bcafaa0        ->LOOP

Interesting. I am tempted to submit a Pull Request to Snabb Switch that enables -DLUAJIT_NUMMODE=2 and see what impact that has on the performance tests that our CI runs. However, I am generally reluctant to apply optimizations that I don't understand reasonably well.

Tweak 2: FFI table

This time I will try a more straightforward change.

The problem I see is that we are doing a bunch of work to check array bounds and convert the table values from floats to ints. Let us try to avoid this by replacing the high-level Lua table with a low-level FFI array of integers.

local counter = require("core.counter")
local ffi = require("ffi")

local n = 1e9
local c = counter.open("test")
local t = ffi.new("int[2]", 1, 10)  -- allocate table as FFI object
for i = 1,n do
   counter.add(c, t[i%2])
end

This actually works pretty well:

     9,022,173,328 instructions              #    4.46  insns per cycle
     2,022,321,563 cycles

Now we are down to two cycles per iteration (for 9 instructions).

Here is the IR:

0032 ------------ LOOP ------------
0033 r15      int BAND   0030  +1
0034 r15      i64 CONV   0033  i64.int sext
0035          i64 BSHL   0034  +2
0036          p64 ADD    0035  0012
0037          p64 ADD    0036  +8
0038          int XLOAD  0037
0039 r15      u64 CONV   0038  u64.int sext
0040 rbp    + u64 ADD    0039  0027
0041          u64 XSTORE 0023  0040
0042 rbx    + int ADD    0030  +1
0043       >  int LE     0042  0001
0044 rbx      int PHI    0030  0042
0045 rbp      u64 PHI    0027  0040

Here is the mcode:

->LOOP:
0bcafaa0  mov r15d, ebx
0bcafaa3  and r15d, +0x01
0bcafaa7  movsxd r15, r15d
0bcafaaa  movsxd r15, dword [rdx+r15*4+0x8]
0bcafaaf  add rbp, r15
0bcafab2  mov [rcx+0x8], rbp
0bcafab6  add ebx, +0x01
0bcafab9  cmp ebx, eax
0bcafabb  jle 0x0bcafaa0        ->LOOP
0bcafabd  jmp 0x0bca001c        ->3

This experiment feels more satisfying. I was able to identify redundant code, eliminate it in a sensible way, and verify that performance improved.

The end

Morals of this story:

  1. Trace compilers and CPUs are still fun and interesting.
  2. Getting the right information is really important: what is actually running (mcode) and why (IR). The process of piecing this together is quite interesting and may lead to automated shortcuts in the future.
  3. Relatively naive optimization techniques can be effective. In this case we have 7.5x performance by simply looking at what happens and asking ourselves "can we massage this to run inside one trace?", "can we get rid of the Array Bounds Check (ABC)?", "can we get rid of the float to integer conversion (cvttsd2si)?".
wingo commented 8 years ago

Have you tried combining hacks 1 and 2? The movsxd instructions in your last trace are kinda useless and different number modes might help.

lukego commented 8 years ago

@andywingo This is interesting...

NUMMODE did not make a difference but I was able to eliminate one of those instructions by declaring the array as long instead of int. Here is the code side-by-side with the tweaked on the left and the original on the right:

NEW                            ORIGINAL
mov r15d, ebx                  mov r15d, ebx
and r15d, +0x01                and r15d, +0x01
movsxd r15, r15d               movsxd r15, r15d
                               movsxd r15, dword [rdx+r15*4+0x8]
add rbp, [rdx+r15*8+0x8]       add rbp, r15
mov [rcx+0x8], rbp             mov [rcx+0x8], rbp
add ebx, +0x01                 add ebx, +0x01
cmp ebx, eax                   cmp ebx, eax
jle 0x0bcafaa0        ->LOOP   jle 0x0bcafaa0        ->LOOP

The interesting thing is that despite the extra instructions in the original these two versions seem to have exactly the same performance on a Haswell: they execute in the same number of cycles, they keep the execution units occupied in the same proportions, and they execute the same number of micro-ops.

Original version:

     9,109,469,170 instructions              #    4.51  insns per cycle
     2,019,864,136 cycles                         [45.14%]
     7,929,698,496 uops_executed_core             [45.11%]
     1,113,614,873 uops_executed_port_port_0_core [45.35%]
     1,121,795,934 uops_executed_port_port_1_core [47.01%]
       766,576,157 uops_executed_port_port_2_core [37.37%]
       784,090,078 uops_executed_port_port_3_core [37.17%]
     1,005,587,997 uops_executed_port_port_4_core [37.03%]
     1,227,980,657 uops_executed_port_port_5_core [36.81%]
     1,514,246,000 uops_executed_port_port_6_core [36.55%]
       440,057,517 uops_executed_port_port_7_core [35.98%]

New version:

     8,054,902,028 instructions              #    4.08  insns per cycle         [45.76%]
     1,972,087,109 cycles                         [45.76%]
     7,956,060,069 uops_executed_core             [45.68%]
     1,130,709,507 uops_executed_port_port_0_core [45.68%]
     1,158,165,282 uops_executed_port_port_1_core [45.68%]
       692,751,903 uops_executed_port_port_2_core [36.31%]
       784,975,608 uops_executed_port_port_3_core [37.24%]
       990,362,807 uops_executed_port_port_4_core [37.03%]
     1,206,127,212 uops_executed_port_port_5_core [36.81%]
     1,479,723,444 uops_executed_port_port_6_core [36.60%]
       517,943,434 uops_executed_port_port_7_core [36.39%]

Could the instruction decoder in the Haswell actually be eliminating the unnecessary work while translating into micro-ops, so that the execution units in the backend don't actually do any work for the extra instruction? (Is that what they call macro-fusion in the CPU front-end?)

Or could be that I am rounding the numbers from perf too fuzzily too.. :-)

lukego commented 8 years ago

@andywingo Tricky... I had the impression from perf stat output that the CPU execution units were mostly idle (none more than half occupied). However, this is not what it says at all. For example Execution Unit 6 (ALU/Branch) executed an average 1.5 micro-ops every 2 cycles.

My confusion comes from the [45.76%] numbers on the right-hand column. My brain automatically treated these as a ratio compared to CPU cycles (as with the instructions per cycle annotation). That is not what they mean at all though. I think they are actually saying how many samples perf used to extrapolate its values from while managing some internal multiplexing of performance counter hardware.

Generally I would really like to have more control over performance counter monitoring. For example to separately count events for different loops in the code (Snabb apps). If a loop is executing for around a thousand instructions then it seems like it should be meaningful to track its performance counters separately without too much risk of measurement error.

Could look into PAPI or Linux/perf primitives for self-profiling programs. That could really make life easier. Unlike these toy example programs most of the programs I want to analyze don't spend 99% of their execution time in one loop and that makes them less compatible with a simple perf stat command line.

wingo commented 8 years ago

I asked JSC people about PAPI a long time ago and they said "why bother, we have a JIT compiler, we can just emit the instructions we need to get the information we care about". Sounds like this is a thing that could be added to LuaJIT.

lukego commented 8 years ago

I'm looking at jevents from pmu-tools now... Could clone in LuaJIT? Looks tiny

wingo commented 8 years ago

I actually don't get why luajit is doing all the movsxd's. mov r15d, ebx should set the upper 32 bits of r15 to zero anyway (section 3.4.1.1 of intel basic architecture manual entitled "General-Purpose Registers in 64-Bit Mode"). The and doesn't change this. Range inference should prove that the loop variable is between 1 and 1e9. Therefore the movsxd is useless.

lukego commented 8 years ago

Hmmm.. Is the kernel perf events business really useful or do we just need an RDPMC instruction? (I don't think we care much about context switches, interrupts, etc so a very raw interface might be fine)

lukego commented 8 years ago

Also should check out Agner on this: http://agner.org/optimize/#testp

lukego commented 8 years ago

@andywingo btw it's now really easy to mix asm with Lua since SnabbCo/snabbswitch#565.

lukego commented 8 years ago

Wrong issue :-) I mean https://github.com/SnabbCo/snabbswitch/pull/575

Githubbing via phone..

lukego commented 8 years ago

@andywingo I have been digging a little bit to find a suitable software interface to access performance monitoring counters. Currently it looks like Linux perf_event_open(2) is the right primitive and then a LuaJIT clone of pmu-tools jevents could be added on top to easily recognize all of the specific counters available on various CPUs.

I had hoped that we could simply use the RDPMC instruction to read arbitrary counters directly. However, the hardware is not that simple. The CPU can only track a small number of performance counters at any one time and these have to be initialized from ring 0. Agner seems to use a custom kernel module (?) for initialization and then limit himself to tracking a modest number of counters at one time (max 6).

I don't want to depend on a kernel module and I also want to be able to track lots of counters at the same time. Tracking performance issues with these counters seems really like a "search for needle in haystack" problem some times. Low performance could be caused by dcache miss, icache miss, dTLB miss, branch miss, cache coherency protocol conflicts, QPI latency, or a zillion other things. It would be nice to be able to check all of these in parallel in order to generate a hypothesis for what the problem could be. (perf can already do this today but only for the whole process and not per-app).

The kernel interface seems to improve on this quite nicely:

  1. Provide a generic way to specify which counters to track (no need for own module).
  2. Use sampling to multiplex many logical counters onto the available hardware resources. (Different counters will be monitored at different times; the reports will say "Counter FOO collected 1000 events and was running 20% of the time" and so we can estimate that there would have been 5000 events had it been running continuously. (This is where those funny % numbers in perf output come from.)
  3. Provide a detailed event log in a memory region that we can mmap. We could scan this log to find interesting information like the addresses of cache misses which we could then translate into higher level terms (e.g. memory region of packets vs. VM vs. Lua, etc).

The main element lacking (?) from the kernel interface seems to be broad support for all the zillions of counters that exist in a given CPU e.g. Xeon Haswell E5-26xx. This is added by the jevents C library and its companion Python program that detects what CPU you have and downloads the event specification as a JSON file. This sounds quite handy but it is not very Snabbish: I don't think we want to link a C JSON parser library into our process nor add Python as a build dependency. So this would perhaps turn into a 100 line LuaJIT program that would make it easy to track a chosen set of performance events for a specified region of code (e.g. separately for each app).

The main alternative that may be a shortcut would be to integrate better with standard commands like perf stat if they could be taught to track different parts of the program separately (app by app). If that is easy then it could be a simpler option. On the other hand it seems like we have more potential to do cool stuff inside the LuaJIT process.

That is how it looks based on today's googling anyway.

bratao commented 8 years ago

@lukego, you are a very good candidate to pick LuaJit development ! How that sounds like to you ??

lukego commented 8 years ago

@bratao I will do my bit :).

mraleph commented 8 years ago

[I was just walking past]. Two quick observations:

  1. You misidentified part of the ALOAD as part of the ABC

    0033          p32 AREF   0015  0031
    0034 xmm7  >  num ALOAD  0033
    0bcafa80  cmp dword [rdx+r15*8+0x4], 0xfffeffff ;; this is a *num* type check associated with ALOAD.
    0bcafa89  jnb 0x0bca0018        ->2
    0bcafa8f  movsd xmm7, [rdx+r15*8]
  2. Instead of using a table you should really use bitwise math for this case: something like bit.bor(bit.band((i % 2) - 1, -9) + 10, 0) - I think this would yield best code you could hope to get from LuaJIT for this particular type of branchy code.