Missed devirtualization of hot function call makes Binary-Trees C++ benchmark 60% slower than on GCC (Benchmarks Game)

yurai007 commented 2 years ago

Problem description

Consider following C++ snippet: https://godbolt.org/z/7fxz1rhbo. The most important part of example is recursive function make doing Node allocation through function allocate before calling itself:

void* mem = store.allocate(sizeof(Node), alignof(Node));

In allocate function there is call to do_allocate which is virtual:

void* allocate(size_t bytes, size_t alignment = alignof(max_align_t)) {
        return ::operator new(bytes, do_allocate(bytes, alignment));
}

In OK case scenario - binary produced by GCC, do_allocate is devirtualized and then inlined together with allocate. Finally make function contains only direct calls to overriden function - do_allocate_impl without recursion:

make(int, monotonic_buffer_resource&) [clone .constprop.1]:
        push    r12
        push    rbp
        mov     rbp, rdi
        sub     rsp, 8
        mov     rsi, QWORD PTR [rdi+8]
        call    monotonic_buffer_resource::do_allocate_impl(void*)
        mov     rsi, QWORD PTR [rbp+8]
    ...

Unfortunately assembly produced by Clang is much worse. In make output do_allocate is not devirtualized to do_allocate_impl, indirect call through vtable can be seen:

make(int, monotonic_buffer_resource&):  # @make(int, monotonic_buffer_resource&)
        push    rbp
        push    r14
        push    rbx
        mov     rbx, rsi
        mov     ebp, edi
        mov     rax, qword ptr [rsi]
        mov     esi, 16
        mov     edx, 8
        mov     rdi, rbx
        call    qword ptr [rax + 16]

I'm not 100% sure but I believe that missing devirtualization opportunity leads to preserving recursion. In OK case we could see that GCC was able to get rid of make recursion. However it's not the case for Clang, make still calls itself:

        mov     rsi, rbx
        call    make(int, monotonic_buffer_resource&)
        mov     qword ptr [r14], rax
        mov     edi, ebp
        mov     rsi, rbx
        call    make(int, monotonic_buffer_resource&)

Impact on Benchmarks Game Binary-Trees benchmark

Allocate function call is hotspot in one of C++ Benchmarks Game programs - Binary-Trees (currently top one): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html You can easily spot out relevant difference between compilers output (make function) in benchmark assembly: https://godbolt.org/z/813nMn7Pd

After building (using exact command from: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html) and running binarytrees-gpp-7 benchmark (in my case it's x86_64 Skylake box), it's clear that Clang binary is ~60% slower than GCC binary:

[yurai@archlinux release]$ time ./binarytrees-clang 21
stretch tree of depth 22     check: 8388607
...
long lived tree of depth 21  check: 4194303

real    0m4.646s

[yurai@archlinux release]$ time ./binarytrees-g++ 21
stretch tree of depth 22     check: 8388607
...
long lived tree of depth 21  check: 4194303

real    0m2.915s

Potential root cause

As far as I can tell missed devirtualization is connected to lack of overriden function emission in CodeGen, just after parsing AST in frontend. It can be narrowed down to CodeGen::CodeGenModule::EmitTopLevelDecl. When ran for virtual do_allocate it seems that its callee - CodeGenModule::EmitGlobalDefinition doesn't emit any thunks and later ScalarExprEmitter::VisitCallExpr doesn't visit overriden do_allocate_impl.

Workaround attempts

I couldn't find any easy way in persuading Clang to better code generation (in particular in forcing do_allocate devirtualization) for both original example and Binary-Trees benchmark. Using final specifier doesn't change anything. Enabling more optimizations via -Ofast/-flto/-flto=thin doesn't help as well which make sense given that issue probably has nothing to do with middle-end. Maybe the only way is more extensive code change, but it's something that I wanted to avoid.

yurai007 commented 2 years ago

Small update: After playing a bit with original snippet and further reduction to https://godbolt.org/z/4eeEGeYWG found out that lack of ~monotonic_buffer_resource definition is crucial here. After making it default Clang is able to devirtualize do_allocate_impl call. Interesting.

davidbolvansky commented 1 year ago

Fails here https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/DeclCXX.cpp#L2294

  // If that method is pure virtual, we can't devirtualize. If this code is
  // reached, the result would be UB, not a direct call to the derived class
  // function, and we can't assume the derived class function is defined.

CXXMethodDecl 0x555b40e7f1c0 <example.cpp:9:5, col:35> col:19 referenced do_allocate 'void *()' virtual pure

After making it default Clang is able to devirtualize do_allocate_impl call. Interesting.

Clang seems not, LLVM does it.

llvm / llvm-project

Missed devirtualization of hot function call makes Binary-Trees C++ benchmark 60% slower than on GCC (Benchmarks Game) #56547