Here is a very rough thought experiment following the discussion of JIT-CPU mechanical sympathy in #30. Let's look at a tiny loop of assembler code:
loop:
cmp rax, 0 ; Guard for invalid null value
je abort ; Branch if guard fails
mov rax, [rax] ; Load next value
cmp rax, rbx ; Check if the loaded value matches rbx
jnz loop ; No? Continue
abort:
This loop follows a chain of pointers in rax trying to find a match for the value in rbx. A guard is used to detect the exceptional case where the pointer value is zero. Such guard code is typical of what the JIT would generate to "speculate" that a value is non-null.
It's roughly equivalent to this C code:
do {
if (x == NULL) goto abort;
x = (intptr_t*)*x;
} while (x != y);
abort:
What would the CPU do with this function?
First the frontend would fetch and decode the first couple of instructions:
cmp rax, 0
je abort
and then it would continue fetching instructions based on the prediction that the je branch is not taken. This would give us:
and then the CPU would continue to fill up its window of ~100 concurrently executing instructions by predicting that the loop will continue keep going around and around:
cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
... until ~100-entry instruction re-order buffer is full ...
So the CPU frontend will keep streaming copies of the loop into the backend for execution.
The backend will then infer some details of the data dependencies between these instructions:
The conditional branch instructions depend on the flags from their preceding comparison instruction.
The guard and the pointer-chasing of each iteration are independent of each other. Neither uses the results of the other's instructions.
The guard and the pointer-chasing both depend on the pointer-chasing done in the previous iteration i.e. the value loaded into rax.
This means that on each iteration the CPU can execute two independent chains of instructions with instruction-level parallelism (ILP):
This is good. The pointer-chasing instructions are not delayed waiting for the guard to complete. Everything can run in parallel. So does that mean that the guards are for free?
Yep!
But to really know that we have to consider whether the overall performance of this code is limited by throughput (how quickly new instructions can be issued) or by latency (how long it takes for old instructions to complete.) If the limit is throughput then we have to pay for the guards because they are competing with the pointer-chasing for scarce execution resources. If the limit is latency of the pointer-chasing code then the guards are actually free because they only use CPU capacity that would otherwise be wasted.
For this code performance will specifically be limited by the latency of these memory loads that are chained between iterations of the loop:
and everything else is irrelevant. The reason is that each load will take at least four cycles to complete (L1 cache latency) and the next load always has to wait for the previous one to finish (to get the right address.)
So the CPU will have at least four cycles to execute the instructions for each iteration of the loop. That's much more than enough time to execute the four cheap instructions that accompany each load. The CPU backend will be underutilized whether the guard instructions are there or not.
Here is a very rough thought experiment following the discussion of JIT-CPU mechanical sympathy in #30. Let's look at a tiny loop of assembler code:
This loop follows a chain of pointers in
rax
trying to find a match for the value inrbx
. A guard is used to detect the exceptional case where the pointer value is zero. Such guard code is typical of what the JIT would generate to "speculate" that a value is non-null.It's roughly equivalent to this C code:
What would the CPU do with this function?
First the frontend would fetch and decode the first couple of instructions:
and then it would continue fetching instructions based on the prediction that the
je
branch is not taken. This would give us:and then the CPU would continue to fill up its window of ~100 concurrently executing instructions by predicting that the loop will continue keep going around and around:
So the CPU frontend will keep streaming copies of the loop into the backend for execution.
The backend will then infer some details of the data dependencies between these instructions:
rax
.This means that on each iteration the CPU can execute two independent chains of instructions with instruction-level parallelism (ILP):
This is good. The pointer-chasing instructions are not delayed waiting for the guard to complete. Everything can run in parallel. So does that mean that the guards are for free?
Yep!
But to really know that we have to consider whether the overall performance of this code is limited by throughput (how quickly new instructions can be issued) or by latency (how long it takes for old instructions to complete.) If the limit is throughput then we have to pay for the guards because they are competing with the pointer-chasing for scarce execution resources. If the limit is latency of the pointer-chasing code then the guards are actually free because they only use CPU capacity that would otherwise be wasted.
For this code performance will specifically be limited by the latency of these memory loads that are chained between iterations of the loop:
and everything else is irrelevant. The reason is that each load will take at least four cycles to complete (L1 cache latency) and the next load always has to wait for the previous one to finish (to get the right address.)
So the CPU will have at least four cycles to execute the instructions for each iteration of the loop. That's much more than enough time to execute the four cheap instructions that accompany each load. The CPU backend will be underutilized whether the guard instructions are there or not.
Piece of cake, right? :grimace:
See also this old thread about microbenchmarking different kinds of guards: https://github.com/LuaJIT/LuaJIT/issues/248.