lukego / blog

Luke Gorrie's blog
566 stars 11 forks source link

Thought experiment on guard instructions and CPU micro architecture #31

Open lukego opened 6 years ago

lukego commented 6 years ago

Here is a very rough thought experiment following the discussion of JIT-CPU mechanical sympathy in #30. Let's look at a tiny loop of assembler code:

loop:
      cmp rax, 0      ; Guard for invalid null value
      je abort        ; Branch if guard fails
      mov rax, [rax]  ; Load next value
      cmp rax, rbx    ; Check if the loaded value matches rbx
      jnz loop        ; No? Continue
abort:

This loop follows a chain of pointers in rax trying to find a match for the value in rbx. A guard is used to detect the exceptional case where the pointer value is zero. Such guard code is typical of what the JIT would generate to "speculate" that a value is non-null.

It's roughly equivalent to this C code:

  do {
    if (x == NULL) goto abort;
    x = (intptr_t*)*x;
  } while (x != y);
 abort:

What would the CPU do with this function?

First the frontend would fetch and decode the first couple of instructions:

cmp rax, 0
je abort

and then it would continue fetching instructions based on the prediction that the je branch is not taken. This would give us:

cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop

and then the CPU would continue to fill up its window of ~100 concurrently executing instructions by predicting that the loop will continue keep going around and around:

cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
cmp rax, 0
je abort
mov rax, [rax]
cmp rax, rbx
jnz loop
... until ~100-entry instruction re-order buffer is full ...

So the CPU frontend will keep streaming copies of the loop into the backend for execution.

The backend will then infer some details of the data dependencies between these instructions:

This means that on each iteration the CPU can execute two independent chains of instructions with instruction-level parallelism (ILP):

Guard         Pointer-chase
------------  --------------
cmp rax, 0    mov rax, [rax]
je abort      cmp rax, rbx
              jnz loop

This is good. The pointer-chasing instructions are not delayed waiting for the guard to complete. Everything can run in parallel. So does that mean that the guards are for free?

Yep!

But to really know that we have to consider whether the overall performance of this code is limited by throughput (how quickly new instructions can be issued) or by latency (how long it takes for old instructions to complete.) If the limit is throughput then we have to pay for the guards because they are competing with the pointer-chasing for scarce execution resources. If the limit is latency of the pointer-chasing code then the guards are actually free because they only use CPU capacity that would otherwise be wasted.

For this code performance will specifically be limited by the latency of these memory loads that are chained between iterations of the loop:

mov rax, [rax]
mov rax, [rax]
mov rax, [rax]
mov rax, [rax]
mov rax, [rax]
...

and everything else is irrelevant. The reason is that each load will take at least four cycles to complete (L1 cache latency) and the next load always has to wait for the previous one to finish (to get the right address.)

So the CPU will have at least four cycles to execute the instructions for each iteration of the loop. That's much more than enough time to execute the four cheap instructions that accompany each load. The CPU backend will be underutilized whether the guard instructions are there or not.

Piece of cake, right? :grimace:

See also this old thread about microbenchmarking different kinds of guards: https://github.com/LuaJIT/LuaJIT/issues/248.