DimitryAndric commented 10 years ago


Bugzilla Link	21273
Version	trunk
OS	All
CC	@ahatanak,@echristo,@emaste,@grimreaper

Extended Description

Depending on the optimization level, clang trunk r219624 seems to sometimes handle output operands incorrectly. This was reported to me by FreeBSD kernel developers, who attempted to compile the following:

int ivy_rng_store(long *buf) { long tmp; int retry;

retry = 10; asm volatile( "1:\n\t" "rdrand %2\n\t" / read randomness into tmp / "jb 2f\n\t" / CF is set on success, exit retry loop / "dec %0\n\t" / otherwise, retry-- / "jne 1b\n\t" / and loop if retries are not exhausted / "jmp 3f\n" / failure, retry is 0, used as return value / "2:\n\t" "mov %2,%1\n\t" / buf = tmp / "3:" : "+q" (retry), "=m" (buf), "=q" (tmp) : : "cc");

return (retry); }

E.g., the intent is that 'tmp' is just used for output, but the actual value is not used outside the inline asm. It is stored to *buf instead.

However, clang -O0 seems to have trouble keeping the two apart, as the resulting assembly is:

ivy_rng_store: # @ivy_rng_store .cfi_startproc

BB#0: # %entry

pushq   %rbp

.Ltmp0: .cfi_def_cfa_offset 16 .Ltmp1: .cfi_offset %rbp, -16 movq %rsp, %rbp .Ltmp2: .cfi_def_cfa_register %rbp movq %rdi, -8(%rbp) movl $10, -20(%rbp) movl -20(%rbp), %eax movq -8(%rbp), %rdi

APP

.Ltmp3: rdrandq %rdi jb .Ltmp4 decl %eax jne .Ltmp3 jmp .Ltmp5 .Ltmp4: movq %rdi, (%rdi) .Ltmp5:

NO_APP

movl    %eax, -20(%rbp)
movq    %rdi, -16(%rbp)
movl    -20(%rbp), %eax
popq    %rbp
retq

Clearly, the movq %rdi, (%rdi) is incorrect. This seems to be magically solved by enabling optimization, e.g. at -O1 or higher:

ivy_rng_store: # @ivy_rng_store .cfi_startproc

BB#0: # %entry

pushq   %rbp

.Ltmp0: .cfi_def_cfa_offset 16 .Ltmp1: .cfi_offset %rbp, -16 movq %rsp, %rbp .Ltmp2: .cfi_def_cfa_register %rbp movl $10, %eax

APP

.Ltmp3: rdrandq %rcx jb .Ltmp4 decl %eax jne .Ltmp3 jmp .Ltmp5 .Ltmp4: movq %rcx, (%rdi) .Ltmp5:

NO_APP

popq    %rbp
retq

Something similar happens when targeting i386 at -O0:

ivy_rng_store: # @ivy_rng_store

BB#0: # %entry

pushl   %ebp
movl    %esp, %ebp
subl    $12, %esp
movl    8(%ebp), %eax
movl    %eax, -4(%ebp)
movl    $10, -12(%ebp)
movl    -12(%ebp), %eax
movl    -4(%ebp), %ecx
#APP

.Ltmp0: rdrandl %ecx jb .Ltmp1 decl %eax jne .Ltmp0 jmp .Ltmp2 .Ltmp1: movl %ecx, (%ecx) .Ltmp2:

NO_APP

movl    %eax, -12(%ebp)
movl    %ecx, -8(%ebp)
movl    -12(%ebp), %eax
addl    $12, %esp
popl    %ebp
retl

However, on i386 optimization does not fix it, e.g. at -O1 or higher:

ivy_rng_store: # @ivy_rng_store

BB#0: # %entry

pushl   %ebp
movl    %esp, %ebp
movl    8(%ebp), %ecx
movl    $10, %eax
#APP

.Ltmp0: rdrandl %ecx jb .Ltmp1 decl %eax jne .Ltmp0 jmp .Ltmp2 .Ltmp1: movl %ecx, (%ecx) .Ltmp2:

NO_APP

popl    %ebp
retl

Changing the output constraint on 'tmp' to "+q" seems to help on amd64, but on i386 it still produces incorrect output at -O1 or higher optimization.

I tested the above code with different versions of gcc (4.7 through 5.0), but the resulting assembly was always as expected, at any optimization level.

DimitryAndric commented 10 years ago

As noted on the freebsd-hackers@ mailing list[1], the & modifier may actually be necessary for both 'retry' and 'tmp'.

Quoting from recent gcc documentation[2] (which is really the only reference... :), particularly:

======================================== The same problem can occur if one output parameter (a) allows a register constraint and another output parameter (b) allows a memory constraint. The code generated by GCC to access the memory address in b can contain registers which might be shared by a, and GCC considers those registers to be inputs to the asm. As above, GCC assumes that such input registers are consumed before any outputs are written. This assumption may result in incorrect behavior if the asm writes to a before using b. Combining the `&' constraint with the register constraint ensures that modifying a will not affect what address is referenced by b. Omitting the`&' constraint means that the location of b will be undefined if a is modified before using b.

[1] https://lists.freebsd.org/pipermail/freebsd-hackers/2014-October/046282.html [2] https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

ahatanak commented 10 years ago

If I look at the debug dump during register allocation, nothing in the machine IR tells the register allocator to assign different registers for %1 and %2.

This is the debug dump I see when I compile the IR with "llc -O0 -debug-only=regalloc".

Allocating BB#0: derived from LLVM BB %entry Live Ins: %RDI %vreg0 = COPY %RDI; GR64:%vreg0 %vreg1 = COPY %vreg0; GR64:%vreg1,%vreg0 ... 3:> [sideeffect] [maystore] [attdialect], $0:[regdef:GR32], %vreg5<def,tied13>, $1:[mem], %vreg1, 1, %noreg, 0, %noreg, $2:[regdef:GR64], %vreg6, $3:[reguse tiedto:$0], %vreg5, $4:[clobber], %EFLAGS<earlyclobber,imp-def>, <>; GR32:%vreg5 GR64:%vreg1,%vreg6 ...

%vreg1 = COPY %vreg0; GR64:%vreg1,%vreg0 Regs: RDI=%vreg0* Killing last use: %vreg0 Assigning %vreg1 to %RDI

INLINEASM <es:1: rdrand $2 jb 2f dec $0 jne 1b jmp 3f 2: mov $2,$1 3:> [sideeffect] [maystore] [attdialect], $0:[regdef:GR32], %vreg5<def,tied13>, $1:[mem], %vreg1, 1, %noreg, 0, %noreg, $2:[regdef:GR64], %vreg6, $3:[reguse tiedto:$0], %vreg5, $4:[clobber], %EFLAGS<earlyclobber,imp-def>, <>; GR32:%vreg5 GR64:%vreg1,%verge ... Assigning %vreg6 to %RDI << INLINEASM <es:1: ...

%vreg1 and %vreg6 are the virtual registers that we should pay attention to in this example.

$1:[mem], %vreg1, 1, %noreg, 0, %noreg $2:[regdef:GR64], %vreg6

ahatanak commented 10 years ago

Using the early-clobber modifier "=&q"(tmp) seems to fix the problem, although I'm not sure it's a requirement to use early-clobber in this case.

llvm / llvm-project

inline asm incorrectly handles output operands sometimes #21647

Extended Description

BB#0: # %entry

APP

NO_APP

BB#0: # %entry

APP

NO_APP

BB#0: # %entry

NO_APP

BB#0: # %entry

NO_APP