pcc commented 10 years ago


Bugzilla Link	20376
Resolution	FIXED
Resolved on	Oct 23, 2014 18:36
Version	trunk
OS	All
CC	@compnerd,@jfbastien,@TNorthover

Extended Description

Given the following IR:

declare void @g() declare void @h() declare void @i()

define void @cx(i64 %foo, i64 %bar, i64 %baz) { %cx = cmpxchg i64 %foo, i64 %bar, i64 %baz seq_cst seq_cst %p = extractvalue { i64, i1 } %cx, 1 call void @g() br i1 %p, label %t, label %f

t: call void @h() ret void

f: call void @i() ret void }

we currently generate the following x86 code for the entry block of @cx:

cx: # @cx .cfi_startproc

BB#0:

    pushq   %rax

.Ltmp0: .cfi_def_cfa_offset 16 movq %rsi, %rax lock cmpxchgq %rdx, (%rdi) callq g jne .LBB0_2

This is clearly wrong, as the external call may clobber EFLAGS.

pcc commented 10 years ago

Thanks Tim. This appears to solve the problem in an unreduced test case I have.

TNorthover commented 10 years ago

OK, the change had no effect on the test-suite, and seems to pass all the spot checks I've performed (eliding the COPY where possible, emitting it where necessary), so I've committed it as r220529.

I think even if we want to improve cmpxchg CodeGen more, or take a completely different approach to the problem we'd want ScheduleDAG to represent that EFLAGS dependency properly.

TNorthover commented 10 years ago

Decided to try fixing ScheduleDAG itself, instead of the incoming DAG. Again to make sure the EFLAGS dependency between the SUnits is recognised (I think the DAG we produce is fairly explicit about the matter, actually rather better than the case that is handled of the implicit alignment between an instruction's implicit Uses/Defs and the DAG node's outputs).

This involved a couple of adaptations to recognise that CopyFromReg was special. The initial result was a pusfl/pop COPY, but with the advantage that it only happened when needed rather than appearing absolutely everywhere.

It also leaves open the potential future use of unfoldMemoryOperands to remove the COPY entirely.

On the whole, this is looking like the most promising approach so far. I'll see if I can break it now.

TNorthover commented 10 years ago

I'm now not even sure we can spot a DAG requiring duplication with purely local checks:

define i32 @&#8203;foo(i32* %addr, i32 %desired, i32 %new) {
  %res = cmpxchg i32* %addr, i32 %desired, i32 %new seq_cst seq_cst
  %success = extractvalue { i32, i1 } %res, 1

  %rhs = call i32 @&#8203;bar()

  %ret = select i1 %success, i32 %new, i32 %rhs
  ret i32 %ret
}

declare i32 @&#8203;bar()

Here the cmpxchg has to happen before the call, but the cmov has to happen afterwards so we want to avoid removing the SETCC. The only way you could discover this is by tracking all the dependencies of the CMOV in the DAG, which could get arbitrarily slow.

The difference from previous cases is that CMOV has no chain (unlike BRCOND), so there's not a unique trackable path from LCMPXCHG_DAG to it.

Next stop, Peter's peepholer idea (assuming I can disable the folding completely for CMPXCHG), which might cover the majority of interesting cases if we're lucky.

TNorthover commented 10 years ago

This looks like it's going wrong everywhere, and I've not yet found a neat solution.

Trying to create a similar failure with a SUB32rm as a proxy for the CMPXCHG, I noticed the ScheduleDAG split up the SUB32rm into a SUB32rr and a load using unfoldMemoryOperands. A similar transformation could be done to CMPXCHG, except that ScheduleDAG seems unable to deal with duplicating any SUnits involving CopyFromReg/CopyToReg (on multiple levels).

I can get it to recognise something needs to be done with some rather hairy modifications to the DAG (essentially representing EFLAGS as an i32 result of LCMPXCHG_DAG), but that just brings back the pushfl/pop instructions (though probably less pervasively).

The alternative is to only produce manageable DAGs in the first place. I think that could be made to work (check for exactly one use, adjacent in LCMPXCHG's chain), but I'm still unsure whether it can be done neatly.

jfbastien commented 10 years ago

From a chat with Tim on IRC, it looks like the problem is in InstrEmitter.cpp:

// If all uses are reading from the src physical register and copying the // register is either impossible or very expensive, then don't create a copy. if (MatchReg && SrcRC->getCopyCost() < 0) { VRBase = SrcReg; } else { // Create the reg, emit the copy. VRBase = MRI->createVirtualRegister(DstRC); BuildMI(*MBB, InsertPos, Node->getDebugLoc(), TII->get(TargetOpcode::COPY), VRBase).addReg(SrcReg); }

The first part is wrong. Removing it now generates this very bad code: movb $29, 11(%esp) movb $41, %cl movb $29, %al lock cmpxchgb %cl, 11(%esp) pushfl popl %esi

kill: EFLAGS ESI

sete    %al
movzbl  %al, %eax
movl    %eax, 4(%esp)
movl    $.L.str, (%esp)
calll   printf
pushl   %esi
popfl
jne .LBB0_2

This then causes 154 new test failures. They don't look like correctness failures (all extra instructions in CHECK-NEXT), but they're definitely performance failures which are probably unacceptable.

jfbastien commented 10 years ago

Proposed testcase Attaching a full FileCheck testcase.

pcc commented 10 years ago

More discussion/analysis and an (incorrect) patch here: http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140721/226732.html

jfbastien commented 10 years ago

I just ran into this issues while validating LLVM 3.5.

The issues seems to have been caused by Tim's "r210903 - IR: add "cmpxchg weak" variant to support permitted failure" on June 13th, reviewed at: http://reviews.llvm.org/D4134

The issues crops up after Expand ISel Pseudo-instructions: the IR is correct, but the generated machine code isn't. It knows that EFLAGS is implicitly defined by cmpxchg, but it fails to use the materialized version of it, letting the call clobber it.

For this code: int main() {
uint8_t loc = 29, oldval = 29, newval = 41;
uint8_t res = __sync_bool_compare_and_swap(&loc, oldval, newval);
printf("%u", res);
if (res != 1)
exit(1);
return 0;
}

We get this IR (which is correct): define i32 @main() #0 {
entry:
%loc = alloca i8, align 1
store i8 29, i8 %loc, align 1, !tbaa !1
%0 = cmpxchg i8 %loc, i8 29, i8 41 seq_cst seq_cst
%1 = extractvalue { i8, i1 } %0, 1
%conv1 = zext i1 %1 to i32
%call = call i32 (i8, ...) @printf(i8 getelementptr inbounds ([3 x i8] @.str, i32 0, i32 0), i32 %conv1) #3
br i1 %1, label %if.end, label %if.then

if.then: ; preds = %entry
call void @exit(i32 1) #4
unreachable

if.end: ; preds = %entry
ret i32 0
}

And the following machine code gets generated (which is incorrect): MOV8mi <fi#0>, 1, %noreg, 0, %noreg, 29; mem:ST1%loc
%vreg0 = MOV8ri 29; GR8:%vreg0
%vreg1 = MOV8ri 41; GR8:%vreg1
%AL = COPY %vreg0; GR8:%vreg0
LCMPXCHG8 <fi#0>, 1, %noreg, 0, %noreg, %vreg1, %AL, %EFLAGS, %AL; mem:Volatile LDST1[%loc] GR8:%v$ %vreg2 = COPY %AL; GR8:%vreg2
%vreg3 = SETEr %EFLAGS; GR8:%vreg3
%vreg4 = MOVZX32rr8 %vreg3; GR32:%vreg4 GR8:%vreg3
ADJCALLSTACKDOWN32 8, %ESP<imp-def,dead>, %EFLAGS<imp-def,dead>, %ESP
%vreg5 = COPY %ESP; GR32:%vreg5
MOV32mr %vreg5, 1, %noreg, 4, %noreg, %vreg4; mem:ST4[Stack+4] GR32:%vreg5,%vreg4
MOV32mi %vreg5, 1, %noreg, 0, %noreg, ga:@.str; mem:ST4[Stack] GR32:%vreg5
CALLpcrel32 ga:@printf, , %ESP, %ESP, %EAX
ADJCALLSTACKUP32 8, 0, %ESP<imp-def,dead>, %EFLAGS<imp-def,dead>, %ESP
%vreg6 = COPY %EAX; GR32:%vreg6
JE_4 <BB#2>, %EFLAGS
JMP_4 <BB#1>

LLVM 3.4 used to generate: movb $29, 11(%esp) movb $41, %cl movb $29, %al lock cmpxchgb %cl, 11(%esp) movzbl %al, %esi cmpl $29, %esi sete %al movzbl %al, %eax movl %eax, 4(%esp) movl $.L.str, (%esp) calll printf cmpl $29, %esi jne .LBB0_2 That's pretty suboptimal since it ignore cmpxchg's flag setting behavior.

It now generates: movb $29, 11(%esp) movb $41, %cl movb $29, %al lock cmpxchgb %cl, 11(%esp) sete %al movzbl %al, %eax movl %eax, 4(%esp) movl $.L.str, (%esp) calll printf jne .LBB0_2 This is good because it takes into account the flag setting behavior, but ignores the fact that the ABI doesn't preserve flags.

FWIW GCC decides to use test %al, %al instead of cmp.

I looked at the ISel code a bit, but I'm pretty unfamiliar with it and the changes I tried haven't done what I thought they would so far. Help appreciated :)

compnerd commented 10 years ago

An alternative reproduction (also requiring cmpxchng) without a call that reproduces this clobbering:

define i32 @test(i32* %p, i32 %i, i32 %j) { entry: %cmp = icmp sgt i32 %i, %j br i1 %cmp, label %while.condthread-pre-split.i.preheader, label %cond.end

while.condthread-pre-split.i.preheader: ; preds = %entry br label %while.condthread-pre-split.i

while.condthread-pre-split.i: ; preds = %while.condthread-pre-split.i.preheader, %while.body.i %.pr.i = load i32* %p, align 4 br label %while.cond.i

while.cond.i: ; preds = %while.cond.i, %while.condthread-pre-split.i %0 = phi i32 [ %.pr.i, %while.condthread-pre-split.i ], [ 0, %while.cond.i ] %tobool.i = icmp eq i32 %0, 0 br i1 %tobool.i, label %while.cond.i, label %while.body.i

while.body.i: ; preds = %while.cond.i %.lcssa = phi i32 [ %0, %while.cond.i ] %1 = cmpxchg i32* %p, i32 %.lcssa, i32 %.lcssa seq_cst seq_cst %2 = extractvalue { i32, i1 } %1, 1 br i1 %2, label %cond.end.loopexit, label %while.condthread-pre-split.i

cond.end.loopexit: ; preds = %while.body.i br label %cond.end

cond.end: ; preds = %cond.end.loopexit, %entry %cond = phi i32 [ %i, %entry ], [ 0, %cond.end.loopexit ] ret i32 %cond }

pcc commented 10 years ago

assigned to @TNorthover

llvm / llvm-project

may fail to recognize potential EFLAGS clobbers for compare-exchange operations #20750

Extended Description

BB#0:

kill: EFLAGS ESI