Open myroslavrubanets opened 4 months ago
Thank you very much for the detailed report.
I tried to compile the code with my ClangCL with the -march=znver4
option, and it seems it actually optimizes things just fine. For instance, the following is the codegen for alg64::lemire_branchless
:
00007FF6DF092030 push rsi
00007FF6DF092031 push rdi
00007FF6DF092032 mov r8,2AF31DC4611874h
00007FF6DF09203C mulx rax,r9,r8
00007FF6DF092041 test ax,ax
00007FF6DF092044 sete r10b
00007FF6DF092048 cmp r9,r8
00007FF6DF09204B mov r8,68DB8BAC710CB2Ah
00007FF6DF092055 setb r9b
00007FF6DF092059 shr rax,10h
00007FF6DF09205D and r9b,r10b
00007FF6DF092060 test r9b,r9b
00007FF6DF092063 cmove rax,rdx
00007FF6DF092067 mov rdx,rax
00007FF6DF09206A mulx rdx,r10,r8
00007FF6DF09206F test dl,dl
00007FF6DF092071 sete r11b
00007FF6DF092075 cmp r10,r8
00007FF6DF092078 setb r10b
00007FF6DF09207C shr rdx,8
00007FF6DF092080 and r10b,r11b
00007FF6DF092083 mov r11,28F5C28F5C28F5Dh
00007FF6DF09208D test r10b,r10b
00007FF6DF092090 cmove rdx,rax
00007FF6DF092094 mov rax,rcx
00007FF6DF092097 xor ecx,ecx
00007FF6DF092099 mulx r8,rsi,r11
00007FF6DF09209E cmp rsi,r11
00007FF6DF0920A1 mov r11,199999999999999Ah
00007FF6DF0920AB cmovae r8,rdx
00007FF6DF0920AF setb cl
00007FF6DF0920B2 xor edi,edi
00007FF6DF0920B4 mov rdx,r8
00007FF6DF0920B7 mulx rsi,rdx,r11
00007FF6DF0920BC cmp rdx,r11
00007FF6DF0920BF movzx edx,r9b
00007FF6DF0920C3 cmovae rsi,r8
00007FF6DF0920C7 movzx r8d,r10b
00007FF6DF0920CB setb dil
00007FF6DF0920CF shl r8,2
00007FF6DF0920D3 mov qword ptr [rax],rsi
00007FF6DF0920D6 lea rdx,[r8+rdx*8]
00007FF6DF0920DA lea rcx,[rdx+rcx*2]
00007FF6DF0920DE or rdi,rcx
00007FF6DF0920E1 mov qword ptr [rax+8],rdi
00007FF6DF0920E5 pop rdi
00007FF6DF0920E6 pop rsi
00007FF6DF0920E7 ret
So... I guess maybe something else than the platform CPU is causing this problem?
Default ClangCl release config produces following bench result for 64 bits:
It is clear that all Lemire methods suffer from some kind of codegen issue.
Pausing on breakpoint in the Lemire gives following disassembly for:
auto const r = wuint::umul128(n, UINT64_C(1844674407370955162));
shows thatwuint::umul128
is not inlined.Adding
[[clang::always_inline]]
touint128 umul128(std::uint64_t x, std::uint64_t y) noexcept {
produces following results:The same code now produces disassembly:
Which is fine by itself but probably was not desired if goal was to get MULX extension used.
By adding
target_compile_options(rtz_benchmark_exe PRIVATE "/arch:AVX2")
in Cmake we can get the MULXwith disassembly:
All results were obtained on AMD Ryzen 4 7900 running Win10 Pro and VS2022 community with ClangCL v 17.0.3.
I expect that chasing such compiler differences is not really useful and this feedback can be resolved as not a defect immediately. I found it interesting enough to share.