Open yurai007 opened 2 years ago
Small update: After playing a bit with original snippet and further reduction to https://godbolt.org/z/4eeEGeYWG found out that lack of ~monotonic_buffer_resource
definition is crucial here. After making it default Clang is able to devirtualize do_allocate_impl
call. Interesting.
Fails here https://github.com/llvm/llvm-project/blob/main/clang/lib/AST/DeclCXX.cpp#L2294
// If that method is pure virtual, we can't devirtualize. If this code is
// reached, the result would be UB, not a direct call to the derived class
// function, and we can't assume the derived class function is defined.
CXXMethodDecl 0x555b40e7f1c0 <example.cpp:9:5, col:35> col:19 referenced do_allocate 'void *()' virtual pure
After making it default Clang is able to devirtualize do_allocate_impl call. Interesting.
Clang seems not, LLVM does it.
Problem description
Consider following C++ snippet: https://godbolt.org/z/7fxz1rhbo. The most important part of example is recursive function
make
doing Node allocation through functionallocate
before calling itself:In
allocate
function there is call todo_allocate
which is virtual:In OK case scenario - binary produced by GCC,
do_allocate
is devirtualized and then inlined together withallocate
. Finallymake
function contains only direct calls to overriden function -do_allocate_impl
without recursion:Unfortunately assembly produced by Clang is much worse. In
make
outputdo_allocate
is not devirtualized todo_allocate_impl
, indirect call throughvtable
can be seen:I'm not 100% sure but I believe that missing devirtualization opportunity leads to preserving recursion. In OK case we could see that GCC was able to get rid of
make
recursion. However it's not the case for Clang,make
still calls itself:Impact on Benchmarks Game Binary-Trees benchmark
Allocate
function call is hotspot in one of C++ Benchmarks Game programs - Binary-Trees (currently top one): https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html You can easily spot out relevant difference between compilers output (make
function) in benchmark assembly: https://godbolt.org/z/813nMn7PdAfter building (using exact command from: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-7.html) and running binarytrees-gpp-7 benchmark (in my case it's x86_64 Skylake box), it's clear that Clang binary is ~60% slower than GCC binary:
Potential root cause
As far as I can tell missed devirtualization is connected to lack of overriden function emission in
CodeGen
, just after parsing AST in frontend. It can be narrowed down toCodeGen::CodeGenModule::EmitTopLevelDecl
. When ran for virtualdo_allocate
it seems that its callee -CodeGenModule::EmitGlobalDefinition
doesn't emit any thunks and laterScalarExprEmitter::VisitCallExpr
doesn't visit overridendo_allocate_impl
.Workaround attempts
I couldn't find any easy way in persuading Clang to better code generation (in particular in forcing
do_allocate
devirtualization) for both original example and Binary-Trees benchmark. Usingfinal
specifier doesn't change anything. Enabling more optimizations via-Ofast
/-flto
/-flto=thin
doesn't help as well which make sense given that issue probably has nothing to do with middle-end. Maybe the only way is more extensive code change, but it's something that I wanted to avoid.