Codegen with ClangCL 17 on Windows for Lemire 64 bit benchmarks is suboptimal

Default ClangCl release config produces following bench result for 64 bits:

                               Null (baseline): 0.931991ns
                                         Naive: 9.23445ns
                           Granlund-Montgomery: 7.74372ns
                                        Lemire: 10.0473ns
               Generalized Granlund-Montgomery: 7.79839ns
                                     Naive 2-1: 7.36115ns
                       Granlund-Montgomery 2-1: 6.35962ns
                                    Lemire 2-1: 11.4426ns
           Generalized Granlund-Montgomery 2-1: 6.19803ns
                                   Naive 8-2-1: 7.93523ns
                     Granlund-Montgomery 8-2-1: 7.12279ns
                                  Lemire 8-2-1: 11.8961ns
         Generalized Granlund-Montgomery 8-2-1: 7.25772ns
                              Naive branchless: 3.14772ns
                Granlund-Montgomery branchless: 2.25322ns
                             Lemire branchless: 13.4392ns
    Generalized Granlund-Montgomery branchless: 2.35564ns

It is clear that all Lemire methods suffer from some kind of codegen issue.

Pausing on breakpoint in the Lemire gives following disassembly for: auto const r = wuint::umul128(n, UINT64_C(1844674407370955162)); shows that wuint::umul128 is not inlined.

Adding [[clang::always_inline]] to uint128 umul128(std::uint64_t x, std::uint64_t y) noexcept { produces following results:

                               Null (baseline): 1.10466ns
                                         Naive: 9.364ns
                           Granlund-Montgomery: 7.90428ns
                                        Lemire: 7.72202ns
               Generalized Granlund-Montgomery: 7.76073ns
                                     Naive 2-1: 7.25145ns
                       Granlund-Montgomery 2-1: 6.30838ns
                                    Lemire 2-1: 6.18285ns
           Generalized Granlund-Montgomery 2-1: 6.03679ns
                                   Naive 8-2-1: 7.73168ns
                     Granlund-Montgomery 8-2-1: 7.1402ns
                                  Lemire 8-2-1: 8.00677ns
         Generalized Granlund-Montgomery 8-2-1: 6.97643ns
                              Naive branchless: 3.00643ns
                Granlund-Montgomery branchless: 2.14394ns
                             Lemire branchless: 2.7367ns
    Generalized Granlund-Montgomery branchless: 2.22958ns

The same code now produces disassembly:

    mov         r8,rdx
    mov         r9,199999999999999Ah
    mov         rax,rdx
    mul         rax,r9
    xor         r10d,r10d

Which is fine by itself but probably was not desired if goal was to get MULX extension used.

By adding target_compile_options(rtz_benchmark_exe PRIVATE "/arch:AVX2") in Cmake we can get the MULX

                           Null (baseline): 0.92179ns
                                     Naive: 9.17495ns
                       Granlund-Montgomery: 7.73457ns
                                    Lemire: 7.67657ns
           Generalized Granlund-Montgomery: 7.58593ns
                                 Naive 2-1: 7.17895ns
                   Granlund-Montgomery 2-1: 6.16283ns
                                Lemire 2-1: 6.23226ns
       Generalized Granlund-Montgomery 2-1: 5.84713ns
                               Naive 8-2-1: 7.62849ns
                 Granlund-Montgomery 8-2-1: 7.10212ns
                              Lemire 8-2-1: 7.74391ns
     Generalized Granlund-Montgomery 8-2-1: 6.6372ns
                          Naive branchless: 2.97132ns
            Granlund-Montgomery branchless: 2.13573ns
                         Lemire branchless: 2.66765ns
Generalized Granlund-Montgomery branchless: 2.22146ns

with disassembly:

    mov         rax,rcx  
    mov         r8,199999999999999Ah  
    mulx        r9,rcx,r8

All results were obtained on AMD Ryzen 4 7900 running Win10 Pro and VS2022 community with ClangCL v 17.0.3.

I expect that chasing such compiler differences is not really useful and this feedback can be resolved as not a defect immediately. I found it interesting enough to share.

Thank you very much for the detailed report.

I tried to compile the code with my ClangCL with the -march=znver4 option, and it seems it actually optimizes things just fine. For instance, the following is the codegen for alg64::lemire_branchless:

00007FF6DF092030  push        rsi  
00007FF6DF092031  push        rdi  
00007FF6DF092032  mov         r8,2AF31DC4611874h  
00007FF6DF09203C  mulx        rax,r9,r8  
00007FF6DF092041  test        ax,ax  
00007FF6DF092044  sete        r10b  
00007FF6DF092048  cmp         r9,r8  
00007FF6DF09204B  mov         r8,68DB8BAC710CB2Ah  
00007FF6DF092055  setb        r9b  
00007FF6DF092059  shr         rax,10h  
00007FF6DF09205D  and         r9b,r10b  
00007FF6DF092060  test        r9b,r9b  
00007FF6DF092063  cmove       rax,rdx  
00007FF6DF092067  mov         rdx,rax  
00007FF6DF09206A  mulx        rdx,r10,r8  
00007FF6DF09206F  test        dl,dl  
00007FF6DF092071  sete        r11b  
00007FF6DF092075  cmp         r10,r8  
00007FF6DF092078  setb        r10b  
00007FF6DF09207C  shr         rdx,8  
00007FF6DF092080  and         r10b,r11b  
00007FF6DF092083  mov         r11,28F5C28F5C28F5Dh  
00007FF6DF09208D  test        r10b,r10b  
00007FF6DF092090  cmove       rdx,rax  
00007FF6DF092094  mov         rax,rcx  
00007FF6DF092097  xor         ecx,ecx  
00007FF6DF092099  mulx        r8,rsi,r11  
00007FF6DF09209E  cmp         rsi,r11  
00007FF6DF0920A1  mov         r11,199999999999999Ah  
00007FF6DF0920AB  cmovae      r8,rdx  
00007FF6DF0920AF  setb        cl  
00007FF6DF0920B2  xor         edi,edi  
00007FF6DF0920B4  mov         rdx,r8  
00007FF6DF0920B7  mulx        rsi,rdx,r11  
00007FF6DF0920BC  cmp         rdx,r11  
00007FF6DF0920BF  movzx       edx,r9b  
00007FF6DF0920C3  cmovae      rsi,r8  
00007FF6DF0920C7  movzx       r8d,r10b  
00007FF6DF0920CB  setb        dil  
00007FF6DF0920CF  shl         r8,2  
00007FF6DF0920D3  mov         qword ptr [rax],rsi  
00007FF6DF0920D6  lea         rdx,[r8+rdx*8]  
00007FF6DF0920DA  lea         rcx,[rdx+rcx*2]  
00007FF6DF0920DE  or          rdi,rcx  
00007FF6DF0920E1  mov         qword ptr [rax+8],rdi  
00007FF6DF0920E5  pop         rdi  
00007FF6DF0920E6  pop         rsi  
00007FF6DF0920E7  ret

So... I guess maybe something else than the platform CPU is causing this problem?

jk-jeon / rtz_benchmark

Codegen with ClangCL 17 on Windows for Lemire 64 bit benchmarks is suboptimal #1