The unit of compilation in RaptorJIT

In #27 we said that "every call to a Lua function is inlined, always, without exception." But what are these calls inlined into? What is the unit of compilation?

A reasonable first approximation is to say that the JIT compiles each of the innermost loops in the program separately. Innermost loops are any loops that don't themselves contain loops, neither directly in their source code nor indirectly via a function they call. The unit of compilation is then from the start of each inner loop to its end, and all function calls made within the loop are inlined into the generated code.

This inlining means that each time you call a function in your source code the JIT will compile a separate copy of that function. Consider this inner loop that calls the same function multiple times:

function add(a, b) return a + b end

local x = 0
for i = 1, 100 do
    x = add(x, 42)
    x = add(x, "42")
end

Here we have two calls to the function add and parameter b is an integer in the first call but a string in the second call. (This is not an error because Lua automatically coerces strings to numbers.) Is the diversity of types for the b parameter going to cause problems with speculative optimization? No: the JIT will compile each of the calls separately and specialize the code for each call for separate argument types. This code compiles efficiently because at runtime each of the calls uses self-consistent types.

Let's be advanced and consider the merits of a couple of higher-order functions:

function apply(fn, x)
    return fn(x)
end

function fold(fn, val, array)
    for i = 1, #array do
       val = fn(val, array[i])
    end
    return val
end

Is this code appropriate for a tracing JIT? The potential problem is that each time we compile a call to these functions the JIT will speculatively inline the body of the function that we pass in as fn. This is extremely efficient if we are always passing the same value for fn but otherwise it is expensive. If our intention is to call these library functions many times with different function parameters then we have to consider how those calls will compile.

The answer is that apply is fine but fold is problematic. The reason that apply is fine is that each call will be compiled and optimized separately and so there will not be any interaction between the different fn values. The reason that fold is problematic is that it passes many different values of fn into a loop and calls it there. The loop in fold will be its own compilation unit, meaning that only one copy of this loop body will be compiled, and this compiled code will see all of the different values of fn at runtime. This will force the compiler to repeatedly mispredict the value that fn will have and to generate less-than-optimal code as a result.

The moral of this story is that if you want to optimize the same code for many different uses then you should be careful to make sure each use is compiled separately. Loosely speaking that means to avoid sharing the same inner loop in your source code for multiple uses.

How does that work in practice? In simple terms it means that instead of writing code like this:

local x = 0
x = fold(fn1, x, array)
x = fold(fn2, x, array)

You should simply write two separate loops (for...end) in your source code.

local x = 0
for i = 1, #array do x = fn1(x, array[i]) end
for i = 1, #array do x = fn2(x, array[i]) end

Here fn1 and fn2 are called from separate inner loops and so each will be compiled and optimized separately. Simple enough, eh? (This is assuming that these really are inner loops i.e. that neither fn1 nor fn2 contains a loop!)

If you are really determined to write higher-order functions that have attractive performance characteristics then take a look at advanced tricks like https://github.com/luafun/luafun/pull/33.

(Incidentally: These units of compilation are called traces.)

lukego / blog

The unit of compilation in RaptorJIT #28