x86 backend does not generate optimal code for shl

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

28.93k stars 11.94k forks source link

x86 backend does not generate optimal code for shl #60009

Closed lucic71 closed 1 year ago

lucic71 commented 1 year ago

Hi,

Given the following IR:

define i32 @f(i32 %a, i32 %b) {
  %shift = shl i32 %a, %b
  %cmp = icmp uge i32 %b, 32
  %r = select i1 %cmp, i32 0, i32 %shift
  ret i32 %r
}

the generated code by the x86 backend is:

f:                                      # @f
        mov     ecx, esi
        shl     edi, cl
        xor     eax, eax
        cmp     esi, 32
        cmovb   eax, edi
        ret

as per https://gcc.godbolt.org/z/xsY3jqjMn. The respective IR does not trigger Undefined Behavior as demonstrated by https://alive2.llvm.org/ce/ so the optimized version of the x86 code should be:

f:
        mov     ecx, esi
        shl     edi, cl
        mov     eax, edi
        ret

Lucian

cc @nunoplopes

topperc commented 1 year ago

The shl instruction does not produce 0 for out of bounds shifts. It masks the shift amount with 0x1f. A shift by 32 is treated as a shift by 0.

nunoplopes commented 1 year ago

You're right. Sorry I induced him in error; I swapped x86's semantics with ARM's. Here's the full matrix: https://gcc.godbolt.org/z/r494q8h3M Interesting that the arm version has an 'and' but arm64 doesn't. I don't know if that's a bug or not.

Anyway, closing this. Sorry for the noise.

topperc commented 1 year ago

I suppose this could work for 8-bit and 16-bit shifts on x86. The 0x1f mask is used for those too. Historical artifact on 80286 transistor budget I assume. 8086 treated it as an 8-bit value and shifted by 1 position each cycle using the 8-bit value as a loop counter. The mask was added in 80286 to limit that.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86