llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.37k stars 12.14k forks source link

Loosing optimization for X % C == 0 #49678

Open cassioneri opened 3 years ago

cassioneri commented 3 years ago
Bugzilla Link 50334
Version 12.0
OS Linux
CC @LebedevRI,@RKSimon,@rotateright

Extended Description

Since version 9, the code generated for say, y % 400 == 0, uses the modular inverse optimisation which performs just one multiplication (as opposed to two as it used to do in (y/400)*400 == y).

However, in a reasonable implementation of is_leap_year the code generated looses the optimisation and falls back to the old implementation. Here are C++ and corresponding assembly. (Notice that for y % 100 it has no issue in applying the optimisation.)

auto is_leap_year(unsigned y) { return y % 100 != 0 ? y % 4 == 0 : y % 400 == 0; }

is_leap_year(unsigned int): # @​is_leap_year(unsigned int) imull $-1030792151, %edi, %eax # imm = 0xC28F5C29 rorl $2, %eax cmpl $42949673, %eax # imm = 0x28F5C29 jb .LBB2_2 andl $3, %edi testl %edi, %edi sete %al retq .LBB2_2: movl %edi, %eax imulq $1374389535, %rax, %rax # imm = 0x51EB851F shrq $39, %rax imull $400, %eax, %eax # imm = 0x190 # <--- Multiplication by 400 subl %eax, %edi testl %edi, %edi sete %al retq

Worse, when y has type type it doesn't even apply the old Granlund and Montgomery optimisation and resorts to an idiv instruction. It seems it prefers avoiding branching (using cmov). I doubt this is a good idea.

auto is_leap_year(int y) { return y % 100 != 0 ? y % 4 == 0 : y % 400 == 0; }

is_leap_year(int): # @​is_leap_year(int) movl %edi, %eax imull $-1030792151, %edi, %ecx # imm = 0xC28F5C29 addl $85899344, %ecx # imm = 0x51EB850 rorl $2, %ecx cmpl $42949673, %ecx # imm = 0x28F5C29 movl $400, %ecx # imm = 0x190 movl $4, %esi cmovbl %ecx, %esi cltd idivl %esi # <-- Division by %esi which contains either 4 or 400 testl %edx, %edx sete %al retq

See the above and other relevant code here: https://godbolt.org/z/zdGPTT83q

You might also be interested in these (non scientific) benchmarks: https://quick-bench.com/q/zQp1vXKKpWFvH0Et7nxG6MHnNfI

LebedevRI commented 3 years ago

There is a number of problems here:

  1. we can't sink "and %x, C1"+"urem %x, C2", even thought we could
  2. We then have a icmp eq PHI, 0, so naturally, we can't deal with it as icmp eq (urem %x, c), 0. I guess we need to hoist said icmp into those blocks
  3. We could also end up with icmp eq (rem %x, (select %cond, C1, C2)), 0`. Again, we can't deal with that.