inefficient code generation of C bitfields

llvmbot commented 15 years ago


Bugzilla Link	5039
Version	trunk
OS	Linux
Attachments	test case
Reporter	LLVM Bugzilla Contributor
CC	@asl,@d0k,@lattner,@efriedma-quic

Extended Description

the attached test case produces on X86 this asm:

setidt: .Leh_func_begin1: pushq %rbp .Llabel1: movq %rsp, %rbp .Llabel2: shlq $40, %rdx movabsq $34084860461056, %rax andq %rdx, %rax shlq $45, %rcx orq %rax, %rcx shlq $32, %r8 movabsq $30064771072, %rax andq %r8, %rax movl %esi, %edx orl $2097152, %edx andq $2162687, %rdx orq %rax, %rdx orq %rcx, %rdx movabsq $140737488355328, %rax orq %rdx, %rax movabsq $280405532016639, %rcx andq %rax, %rcx movq %rsi, %rax shlq $32, %rax movabsq $-281474976710656, %rdx andq %rax, %rdx orq %rcx, %rdx movq %rdx, idt(%rip) shrq $32, %rsi movabsq $-4294967296, %rax andq idt+8(%rip), %rax orq %rsi, %rax movq %rax, idt+8(%rip) popq %rbp ret

while gcc generates much shorter:

setidt: .LFB2: andl $3, %ecx andl $31, %edx movw %si, idt(%rip) sall $5, %ecx shrq $16, %rsi andl $7, %r8d orl %ecx, %edx movw %si, idt+6(%rip) shrq $16, %rsi orl $-128, %edx movw $32, idt+2(%rip) movb %r8b, idt+4(%rip) movb %dl, idt+5(%rip) movl %esi, idt+8(%rip) ret

both is with -O2.

d0k commented 13 years ago

With r129990 we're down to

setidt: movq %rsi, %rax andq $-65536, %rax shlq $32, %rax movzwl %si, %edi orq %rax, %rdi andq $31, %rdx shlq $40, %rdx orq %rdi, %rdx andq $3, %rcx shlq $45, %rcx orq %rdx, %rcx andq $7, %r8 shlq $32, %r8 orq %rcx, %r8 movabsq $140737490452480, %rax orq %r8, %rax shrq $32, %rsi movq %rax, idt(%rip) movl %esi, idt+8(%rip) ret

Only 4 instructions longer than gcc's output (and probably slightly faster)

lattner commented 13 years ago

I coulda sworn that we already did this. You're right though, we should handle this late. I'd handle this at X86 ISel time by adding a pattern that matches (and (shl cst), cst2)) and produces two instructions. This should only trigger when the shl has one use.

It is possible to implement this in the .td file with lots of pattern fragments, but it would probably be cleanest to implement it in pure c++ code in X86IselDAGToDAG.cpp

-Chris

efriedma-quic commented 13 years ago

Current status:

The clang-generated code is 24 instructions; the gcc-generated code is 16 instructions.

It is possible to save 4 instructions from the clang-generated code simply by avoiding movabsq for constructs like the following:

long long a(int x) { return ((long long)x&0x1F)<<34; }

This ought to be simple, but I'm not sure how to implement this, given that we only want to push the "and" back inside the shift very late in isel.

The other possible savings is that when a bitfield is constructed in registers and stored into memory, and it is possible to split the construction along a byte boundary, it is possible to trade a shift+or for a store; the gcc-generated code has 6 stores (as opposed to 2 for the clang-generated code), so that completely accounts for the savings of gcc vs. clang.

llvmbot commented 13 years ago

the testcase now generates:

.Leh_func_begin0: pushq %rbp .Ltmp0: movq %rsp, %rbp .Ltmp1: movq %rsi, %rax shlq $32, %rax movabsq $-281474976710656, %rdi andq %rax, %rdi shlq $45, %rcx movabsq $105553116266496, %rax andq %rcx, %rax orq %rdi, %rax shlq $40, %rdx movabsq $34084860461056, %rcx andq %rdx, %rcx shlq $32, %r8 movabsq $30064771072, %rdx andq %r8, %rdx orq %rcx, %rdx movzwl %si, %ecx orq %rdx, %rcx orq %rax, %rcx movabsq $140737490452480, %rax orq %rcx, %rax movq %rax, idt(%rip) shrq $32, %rsi movl %esi, idt+8(%rip) popq %rbp ret

which is 5 instructions shorter (yay!) but still very inefficient

llvmbot commented 14 years ago

The test case from Eli now codegens to 1 instruction, probably thanks to Chris's narrowing work. The original test case still comes out pretty ugly with Clang.

llvmbot commented 14 years ago

This is related to bug 5591, or at least the first issue is. We will see what is left once that is resolved.

efriedma-quic commented 15 years ago

Missed optimization number 1: struct {long long a : 32, b : 32; } x; void a() { x.a = 1; }

gcc codegens this to 1 instruction, clang generates some really messy code. More generally, the clang frontend doesn't try to narrow the bitfield stores, and the backend doesn't know how.

Missed optimization number 2: In the original testcase, gcc figures out that one of the loads is dead; clang doesn't (I assume because it hits the recursion limit in instcombine).

Missed optimization number 3: Some x86-64 specific optimizations are missed, like transforming shlq+movabsq+andq to andq+shlq.

Missed optimization number 4: It looks like there's two or instructions with a constant which could be merged by reassociation.

lattner commented 15 years ago

I'll look at this at some point, crazy busy until after the devmtg.

llvmbot commented 15 years ago

it seems similar to the problems we have in bug 4216 http://llvm.org/bugs/show_bug.cgi?id=4216

bwendling commented 2 years ago

We now produce very good code (https://godbolt.org/z/q7Ex9srjn). It's 1 cycle more expensive than the GCC version.

  .text
  .file "example.c"
  .globl setidt # -- Begin function setidt
  .p2align 4, 0x90
  .type setidt,@function
setidt: # @setidt
  .cfi_startproc
# %bb.0:
  # kill: def $r8d killed $r8d def $r8
  # kill: def $ecx killed $ecx def $rcx
  # kill: def $edx killed $edx def $rdx
  movl idt+12(%rip), %eax
  movzwl %si, %edi
  andl $7, %r8d
  shlq $32, %r8
  orq %rdi, %r8
  shldq $32, %rsi, %rax
  andl $31, %edx
  shlq $40, %rdx
  orq %r8, %rdx
  andl $-65536, %esi # imm = 0xFFFF0000
  shlq $32, %rsi
  andl $3, %ecx
  shlq $45, %rcx
  orq %rdx, %rcx
  orq %rsi, %rcx
  movabsq $140737490452480, %rdx # imm = 0x800000200000
  orq %rcx, %rdx
  movq %rax, idt+8(%rip)
  movq %rdx, idt(%rip)
  retq
.Lfunc_end0:
  .size setidt, .Lfunc_end0-setidt
  .cfi_endproc
  # -- End function
  .type idt,@object # @idt
  .bss
  .globl idt
  .p2align 3
idt:
  .zero 16
  .size idt, 16

  .ident "clang version 15.0.0 (https://github.com/llvm/llvm-project.git 3be9e0ba972cc3486971f745a606e2c44472b655)"
  .section ".note.GNU-stack","",@progbits
  .addrsig

llvm / llvm-project

inefficient code generation of C bitfields #5411

Extended Description