Mechanical sympathy between modern CPUs and tracing JITs

Here is the bad news:

Learning to have mechanical sympathy for a modern CPU is hard.
Learning to have mechanical sympathy for a modern tracing JIT like RaptorJIT or LuaJIT is also hard.

Here is the good news:

Both are hard in many of the same ways. Pleasing one often implies pleasing the other too. You could say that the JIT and the CPU are sympathetic to each other.

Often what the JIT wants is an exaggerated version of what the CPU wants. You have to work hard to please the JIT but in doing so you also please the CPU.

Here are a few important cases where the CPU and the JIT are in sympathy.

Consistent control flow

CPUs and tracing JITs both crave consistent control flow. They both optimize by speculating (#25) on the outcome of every branch. The cost of the CPU mispredicting a branch is equivalent to executing dozens of instructions. The cost of the JIT mispredicting a branch can be to actually execute dozens of additional instructions. This means that making the control flow of your program consistent effectively optimizes for both the CPU and the JIT simultaneously.

Out-Of-Order execution

JIT code is based on speculative optimizations (#26). This means that the instructions doing "real work" are interleaved with extra instructions performing "guards" to ensure that the optimized code is applicable. Each guard performs some computation, for example to compare whether a given function definition is still the same as it was at compile time, and finishes with a conditional branch so that execution only continues if the guard succeeds.

On older "in order" CPUs these guard instructions would be a major drag on performance. The guards would always be "in the way" of the real work. The more guards you check, the less real work you do, the slower your application runs. Getting dynamic language code to run efficiently on in-order processors has historically required extreme measures, like the MIT Lisp Machine that microcoded special instructions to both do work (e.g. arithmetic) and check guards (e.g. detect non-number arguments) on the same cycle.

On modern "out of order" CPUs life is much better. Today's CPUs operate on a window of around 100 instructions at the same time and are able to execute many instructions in the same cycle and out of sequential order. Performance is often limited not by the number of instructions that have to be executed (throughput) but by the scheduling delays caused by inter-instruction dependencies (latency.)

Guards are amongst the cheapest instructions for the CPU to handle. Critically, they do not add any latency to the main computation with data dependencies, but only a control dependency that is essentially free when correctly predicted by the CPU. Guard instructions can often be executed by "soaking up" spare CPU resources that would otherwise be wasted. The CPU can execute guard instructions "in the background" when the CPU is underutilized e.g. when instructions for the main computation are waiting for data to load from cache before they make progress. The exact performance depends on context but in practice guard instructions are often very cheap to execute.

Tight loops

The JIT loves branchless inner loops above all else. These are optimized very effectively because the JIT inlines the entire body of the loop and performs extensive inter-procedural optimization including making sure that loop-invariant instructions are only executed once. If your code is spending its time running branchless inner loops then your performance is usually great.

The CPU loves branchless inner loops too. The small code footprint helps the CPU frontend to "feed the beast" with instructions by making effificent use of hardware resources like instruction cache and branch prediction cache. The consistent control flow also allows the CPU to fill its out-of-order instruction window and use the results of its speculative execution.

lukego / blog