Open lukego opened 9 years ago
The jump from root trace to side trace should be at here?
0bcafa10 test ebx, 0x1
0bcafa16 jnz 0x0bca0014 ->1
It should not be at the LOOP part.
The root trace should be patched (change "jnz 0x0bca0014" to "jnz 0x0bcaf922", the entry address of the side trace) after the side trace created.
Let's look at a bad case for tracing JITs. The simple function we looked at in #6 worked really well but turns out to be quite fragile. Let us look at how to break it and see if we can learn something in the process.
I write this from my perspective as a humble application programmer who only needs to understand the fundamentals well enough to write efficient programs. I will not be offering any grand insights into compilers on this blog (though perhaps commenters will).
Background
Back in #6 I wrote a simple Lua module for Snabb Switch and found that the JIT compiled it surprisingly well and that the CPU executed the code surprisingly well too.
The code was a simple loop to bump a counter one billion times. The code is high-level Lua that even includes a hashtable lookup in the inner loop (looking up the
add
function from thecounter
table):LuaJIT compiled this down to a trace: a linear block of machine code that contains at most one loop and otherwise no internal branches. The loop compiled down to five instructions:
and I was pleasantly surprised when
perf stat
told me that my CPU executes all five of those instructions in one cycle:That paints a very pretty picture. But it is easy to shake this up.
Just a little innocent change
Here is an innocent looking new version of the program that contains an
if
to count 1 when the loop index is even and 10 otherwise:How does this version run?
Oops! Now each iteration takes 15 cycles and executes 36 instructions. That is a 15x slow down.
What happened?
The high-level explanation is actually straightforward. The first version runs fast because the loop executes very few instructions: most of the work like table lookups has been "hoisted" to run before entering the loop. The second version runs slowly because it is frequently repeating this setup work.
To understand what happens we can read the JIT dumps for the first version and second version and draw pictures of the flow of control. In these diagrams each box is a trace i.e. a series of machine code instructions that will execute from top to bottom. Branches are drawn as arrows and there are two kinds: the loop back into an earlier part of the trace (at most one is allowed) or an exit to a different trace. The "hot" code that consumes the CPU is highlighted.
Here is a picture of the first version that runs fast:
and below is the full machine code. I don't really bother to read every instruction here. My observation is that the proportions match the diagram: quite a lot of instructions upon entry and then a small number of instructions in the loop.
So what changes in the second version that causes the inner loop to expand from 5 instructions up to 36? Here is the picture:
Now we have two traces: the original root trace and a new side trace. This is necessary because there is a branch (
if
) in our code and traces are not allowed to have internal branches. The root trace will internally handle the case when the loop index is even but it will exit to the side trace when the index is odd. The side trace then rejoins the root trace. The effect is that the loop alternates between saying inside the root trace and exiting to the side trace.The picture also illustrates the two reasons why we execute so many instructions now. First, the side trace is bigger than the loop in the root trace (i.e. it contains more instructions). Second, when the side trace branches back to the root trace it re-enters at the top instead of taking a short-cut into the inner loop. This means that overall we execute more instructions.
Let us zoom in to a bit more detail: first to look at the inner loop in the root trace, then to look at the side trace, and finally to look at the complete root trace that is running every time the side trace branches back.
Here is the new loop in the root trace (with added comments):
The difference from the original trace is the two new instructions at the start. These test a guard for the trace (that the loop index must be even) and branch to an exit if this condition does not hold. So when the loop index happens to be even the execution will be very similar to the original version, but when the loop index is odd we will exit to the side trace.
Here is the code for the side trace.
I have not read this code in detail but here are a couple of observations:
perf
.0bcaf9d4
which turns out to be the beginning of the root trace (not the inner loop).Here finally is the entire root trace, this time including the initial code before the loop that executes when the side trace branches back:
The trace executes an additional 38 instructions before entering the loop. This path will be taken on every second loop iteration, when the exit to the side trace is taken and it branches back to the top. That would seem to account for the rest of the instructions reported by
perf
.So...?
If I were a compiler expert then this is where I would explain why the code compiles in this way and provide interesting links to all the relevant research. But I am not. So all I can really state are my own personal observations.
counter.add(...)
module lookupg - is cheap or expensive? The answer seems very context dependent.I find this all very interesting. The issues are subtle and complex, but most languages and platforms are subtle and complex to optimize when you first start taking them seriously. I am happy to keep geeking out on these ideas.
Thanks
Thanks to Alex Gall and Andy Wingo for mentioning the issue of side-traces re-entering their parent traces at the root, so that I could recognise it when I saw it too.