This snippet of code in your set_context subroutine:
pushq %r8
xorl %eax, %eax
ret
should be changed to:
xorl %eax, %eax
jmp *%r8
And likewise with swap_context.
Modern Intel and AMD CPU microarchitectures have a return stack buffer (RSB) that tracks call and ret invocations so they can speculatively execute past a ret instruction. A mispredicted ret will cause a guaranteed pipeline stall, which will seriously hurt your performance. By contrast, jmp *%r8 is speculated using the indirect branch predictor, which is likely to have a non-zero hit rate.
I can confirm that in my tests on i5 650 (of just swapping between two functions on one pinned thread and counting), jmp makes the entire function 50% faster
This snippet of code in your
set_context
subroutine:should be changed to:
And likewise with
swap_context
.Modern Intel and AMD CPU microarchitectures have a return stack buffer (RSB) that tracks
call
andret
invocations so they can speculatively execute past aret
instruction. A mispredictedret
will cause a guaranteed pipeline stall, which will seriously hurt your performance. By contrast,jmp *%r8
is speculated using the indirect branch predictor, which is likely to have a non-zero hit rate.