128 bit division generates __udivti3 and __umodti3 instead of calling __udivmodti4 once

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

http://llvm.org

Other

27.86k stars 11.48k forks source link

128 bit division generates udivti3 and umodti3 instead of calling __udivmodti4 once #46350

Open danlark1 opened 4 years ago

danlark1 commented 4 years ago


Bugzilla Link	47006
Version	trunk
OS	Linux
CC	@topperc,@RKSimon,@nikic,@rotateright

Extended Description

128 bit division generates udivti3 and umodti3 instead of calling __udivmodti4 once

This happens because of DivRemPairs pass and lack of instrumentation in the backend.

; Unsigned 128-bit division define i128 @udiv128(i128 %a, i128 %b) { %quot = udiv i128 %a, %b %rem = urem i128 %a, %b %sum = add i128 %quot, %rem ret i128 %sum }

https://gcc.godbolt.org/z/PorhMz

Will call udivti3 on LP64 but libgcc and compiler-rt have udivmodti4 which computes the quotient and the remainder at the same time. This particular hurts x86 as divq instruction is presented. Other backends can also benefit from this too

danlark1 commented 4 years ago

Also DivRem is combined in DAGCombiner::useDivRem(SDNode *Node) but

if (!TLI.isTypeLegal(VT) && !TLI.isOperationCustom(DivRemOpc, VT)) return SDValue();

returns false for 128 bit integers

danlark1 commented 4 years ago

I believe currently we don't recognize __udivmodti4 anywhere, in RuntimeLibcalls.def we don't instrument them at all

HANDLE_LIBCALL(SDIVREM_I8, nullptr) HANDLE_LIBCALL(SDIVREM_I16, nullptr) HANDLE_LIBCALL(SDIVREM_I32, nullptr) HANDLE_LIBCALL(SDIVREM_I64, nullptr) HANDLE_LIBCALL(SDIVREM_I128, nullptr) HANDLE_LIBCALL(UDIVREM_I8, nullptr) HANDLE_LIBCALL(UDIVREM_I16, nullptr) HANDLE_LIBCALL(UDIVREM_I32, nullptr) HANDLE_LIBCALL(UDIVREM_I64, nullptr) HANDLE_LIBCALL(UDIVREM_I128, nullptr)

__udivmodti4 should be presented on every LP64 bit platform, I believe

rotateright commented 4 years ago

The DivRemPairs pass turns the IR into this:

define i128 @udiv128(i128 %a, i128 %b) { %a.frozen = freeze i128 %a %b.frozen = freeze i128 %b %quot = udiv i128 %a.frozen, %b.frozen %1 = mul i128 %quot, %b.frozen %rem.decomposed = sub i128 %a.frozen, %1 %sum = add i128 %rem.decomposed, %quot ret i128 %sum }

That's based on the TTI call: bool X86TTIImpl::hasDivRemOp(Type *DataType, bool IsSigned) ...returning false for the 128-bit type.

But even if I hack that to return 'true', I see calls: callq divti3 callq modti3

Where in optimization do we recognize that the target supports "__udivmodti4" and convert to that call?

llvm / llvm-project

128 bit division generates __udivti3 and __umodti3 instead of calling __udivmodti4 once #46350

Extended Description

128 bit division generates udivti3 and umodti3 instead of calling __udivmodti4 once #46350