microsoft / llvm-mctoll

llvm-mctoll
Other
816 stars 125 forks source link

Fix stack promotion of moved registers #122

Closed martin-fink closed 2 years ago

martin-fink commented 3 years ago

This PR consists of two commits:

  1. [X86-64] Do not look up values for xor operations that set a register…
  2. [X86-64] Only replace dominating uses of promoted register with load …

I had to include commit 1., as without it, commit 2. would expose a bug that previously existed, but did not fail mctoll's unit tests.

  1. Do not look up register operand values for xor eax, eax instructions that just zero out a register. I've encountered issues where getRegOperandValue would generate a load instruction from a promoted stack slot. The loaded value was not used, but no issues would arise during tests.
    
    main:
    movabs rdi, offset .L.str
    call strlen # rax defined here
    cmp eax, 0
    je .eq

.neq: mov esi, eax cmp eax, 0 jne .print

.eq: mov esi, 0 jmp .print

.print: movabs rdi, offset .L.str mov al, 0 xor esi, esi call printf

xor eax, eax
ret

The above, compiled and raised, would generate the following output (just relevant parts shown here):

```ll
define dso_local i32 @main() {
entry:
  %ESI-SKT-LOC = alloca i32, align 4
  %EAX = call i32 @strlen(i8* getelementptr inbounds ([30 x i8], [30 x i8]* @rodata_11, i32 0, i32 4))
  %ld-stk-prom8 = load i32, i32* %ESI-SKT-LOC, align 4
  %0 = sub i32 %ld-stk-prom8, 0 ; <-- (A) loading from stack slot, but stack slot has not been written yet!
  %ZF = icmp eq i32 %0, 0
  %CmpZF_JE = icmp eq i1 %ZF, true
  br i1 %CmpZF_JE, label %bb.2, label %bb.1

bb.1:                                             ; preds = %entry
  %6 = sub i32 %EAX, 0
  %7 = call { i32, i1 } @llvm.usub.with.overflow.i32(i32 %EAX, i32 0)
  %CF1 = extractvalue { i32, i1 } %7, 1
  %ZF2 = icmp eq i32 %6, 0
  store i32 %EAX, i32* %ESI-SKT-LOC, align 1
  %CmpZF_JNE = icmp eq i1 %ZF2, false
  br i1 %CmpZF_JNE, label %bb.3, label %bb.2

bb.2:                                             ; preds = %bb.1, %entry
  store i32 0, i32* %ESI-SKT-LOC, align 1
  br label %bb.3

bb.3:                                             ; preds = %bb.2, %bb.1
  %ESI = load i32, i32* %ESI-SKT-LOC, align 1 ; <-- (B) not needed!
  %EAX9 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([30 x i8], [30 x i8]* @rodata_11, i32 0, i32 4), i32 0)
  ret i32 0
}

After commit 1, (B) won't be generated in the code anymore.

In the above code, you can also see the issue that commit 2 fixes. Since we are doing mov esi, eax in bb.1, mctoll replaces all uses of %EAX with a load from the stack slot. This is seen at (A) in the above code. ESI-SKT-LOC has not been written yet, but we're loading from the stack slot. This commit now just replaces uses if they dominate the store to the stack slot.

test/asm_test/X86/mov-reaching-def.s is raised to the following after commit 2.:

define dso_local i32 @main() {
entry:
  %RSI-SKT-LOC = alloca i32, align 4
  %EAX = call i32 @strlen(i8* getelementptr inbounds ([30 x i8], [30 x i8]* @rodata_11, i32 0, i32 4))
  %0 = sub i32 %EAX, 0 ; <-- Using %EAX here!
  %ZF = icmp eq i32 %0, 0
  %CmpZF_JE = icmp eq i1 %ZF, true
  br i1 %CmpZF_JE, label %bb.2, label %bb.1

bb.1:                                             ; preds = %entry
  %6 = sub i32 %EAX, 0
  %ZF2 = icmp eq i32 %6, 0
  store i32 %EAX, i32* %RSI-SKT-LOC, align 1 ; <-- store happening here, after this all uses will be loaded from stack slot
  %CmpZF_JNE = icmp eq i1 %ZF2, false
  br i1 %CmpZF_JNE, label %bb.3, label %bb.2

bb.2:                                             ; preds = %bb.1, %entry
  store i32 0, i32* %RSI-SKT-LOC, align 1
  br label %bb.3

bb.3:                                             ; preds = %bb.2, %bb.1
  %12 = load i32, i32* %RSI-SKT-LOC, align 1 ; <-- load from stack slot
  %RSI = zext i32 %12 to i64
  %EAX7 = call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([30 x i8], [30 x i8]* @rodata_11, i32 0, i32 4), i64 %RSI)
  ret i32 0
}
martin-fink commented 3 years ago

I've just noticed that branch instructions are not yet raised when promotePhysregToStackSlot is called, which means DominatorTree.dominates only works as expected when the two instructions passed to it are in the same basic block. I've changed the to use MachineDominatorTree, which is not as elegant as using DominatorTree would be. I have to look up the MBB for each instruction I'm checking, which results in some overhead while raising. Do you have any ideas on how this could be done more elegantly? I've also updated the test so it now also checks the behaviour I've mentioned in this comment.