Special-case `rep ret` to return flags to caller routine

tkchia commented 1 year ago

This allows UPX's (https://upx.github.io/) x86-64 unpacker routines to work properly. UPX relies on a callee being able to return a carry flag to its caller, and also just happens to use rep ret.

ghaerr commented 1 year ago

Hello @tkchia,

Agreed this is a pretty brilliant way of handling the problem of UPX's requirement, without adding overhead.

I am curious though: what exactly does rep ret do on real hardware, and why does UPX just happen to use it?

Thank you!

tkchia commented 1 year ago

Hello @ghaerr,

what exactly does rep ret do on real hardware, and why does UPX just happen to use it?

rep ret mainly does the same thing as a ret. It was reportedly useful for working around a performance issue on some older AMD processors. GCC will actually emit rep ret under certain conditions:

$ cat test5.c
#include <inttypes.h>

int y;

uint8_t
f (void)
{
  int x = y;
  return x == 2 ? 1 : x * x;
}
$ gcc -S -O3 test5.c -fno-pic -static -fomit-frame-pointer -mtune=k8
$ cat test5.s
    .file   "test5.c"
    .text
    .p2align 4
    .globl  f
    .type   f, @function
f:
.LFB0:
    .cfi_startproc
    endbr64
    movl    y(%rip), %edx
    movl    $1, %eax
    cmpl    $2, %edx
    je  .L1
    movl    %edx, %eax
    imull   %edx, %eax
.L1:
    rep ret
    .cfi_endproc
...

The UPX decompression routines are partly hand-coded in assembly, but presumably the authors also had this AMD performance issue in mind.

Thank you!

ghaerr commented 1 year ago

Hello @tkchia,

Thank you for your explanation, quite interesting. It's kind of amazing what the CPU designers are doing for branch prediction optimizations and CPU pipelines after the CISC instruction set is defined. However, I am coming closer to the realization that nothing should surprise me when it comes to the (over)complexity of AMD and Intel x86 processors!!

jart / blink

Special-case `rep ret` to return flags to caller routine #138