llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.92k stars 11.52k forks source link

Missed optimization: Reverted modification of a global/thread-local that need not be visible to any external calls not optimized out #44021

Open c8cd4046-bf2f-4470-bb42-add0a3666ec0 opened 4 years ago

c8cd4046-bf2f-4470-bb42-add0a3666ec0 commented 4 years ago
Bugzilla Link 44676
Version unspecified
OS Linux
CC @efriedma-quic,@fhahn,@jdoerfert,@duck-37,@LebedevRI,@Ralender

Extended Description

Gcc can very nicely optimize out modifications to global/thread objects if the new value is only used locally (no opaque calls made) and the modification is then reverted.

For example, for:

//_Thread_local //thread local make's clangs code better, but still not optimal
_Bool do_log;

void errlog(char const*);

static inline _Bool sadd(int A, int B, int *R)
{
    if(__builtin_add_overflow(A,B,R)){
        if (do_log) errlog("overflow");
        return 1;
    }
    return 0;
}

_Bool sadd_nolog(int A,int B, int *R)
{
    _Bool r;
    _Bool old_log_settings=do_log; do_log=0;
    r = sadd(A,B,R);
    do_log=old_log_settings;
    return r;
}

https://gcc.godbolt.org/z/UL79D3

gcc -Os generates an 8-byte function on x86-64 while clang -Os generates one that's 64 bytes large.

I don't know how difficult it is to implement such an optimization, but it would be convenient if it could be an optimization that could be counted upon.

efriedma-quic commented 4 years ago

Apparently "opt -dse" partially optimizes that testcase, but "opt -dse -dse" eliminates the load. I guess DSE has pass-ordering issues relative to itself.

9ffb4415-2ba4-4316-b043-121cc6098a61 commented 4 years ago

Here's another interesting testcase:

///////////////////////// NOT OPTIMIZED @​do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @​sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @​do_log, align 1 store i8 0, i8 @​do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8* @​do_log, align 1 ret i1 1 } ////////////////////////

///////////////////////// OPTIMIZED @​do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @​sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @​do_log, align 1 store i8 0, i8 @​do_log, align 1 store i8 %4, i8 @​do_log, align 1 ret i1 1 } ////////////////////////

Seems as though the store in the middle prevents the optimizations from doing their best, but again I don't exactly know why.

jdoerfert commented 4 years ago

I'd guess dead store elimination runs to late and we do not run load to store optimizations (of some form in some pass) after.

9ffb4415-2ba4-4316-b043-121cc6098a61 commented 4 years ago

I have a reduced testcase:

///////////////////////// NOT OPTIMIZED @​do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @​sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @​do_log, align 1 store i8 0, i8 @​do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8* @​do_log, align 1 ret i1 1 } ////////////////////////

//////////////////////// OPTIMIZED @​do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @​sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr #​0 { %4 = load i8, i8 @​do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8 @​do_log, align 1 ret i1 true } /////////////////////////

I think this has something to do with how DSE is done, but I don't know.

jdoerfert commented 4 years ago

One needs to figure out what pass does eventually remove the load store pair: https://gcc.godbolt.org/z/9qBvuP it's gone in the lower part after we run -Os again. Then one can probably figure out why it didn't trigger the first time around.

c8cd4046-bf2f-4470-bb42-add0a3666ec0 commented 4 years ago

Interesting. Providing an initializer (_Bool do_log; => _Bool do_log=0;) stops clang from acting as if it didn't know what the last stored value to do_log was.

Adding _Thread_local has the same effect (even without an explicit initializer).

https://gcc.godbolt.org/z/3NYF2H

But there's still the load/store pair from/to the global, which effectively does nothing:

sadd_nolog: mov cl, byte ptr [rip + do_log] ///<<<<<<<<<unnecessary add edi, esi seto al mov dword ptr [rdx], edi mov byte ptr [rip + do_log], cl ///<<<<<<<<<unnecessary ret

9ffb4415-2ba4-4316-b043-121cc6098a61 commented 4 years ago

It seems like the root cause of this is in ObjectSizeOffsetVisitor::visitGlobalVariable (MemoryBuiltins.cpp)

///////////////////////////////////// if (!GV.hasDefinitiveInitializer()) return unknown(); /////////////////////////////////////

I'm kind of confused as to why this is here. Do we need to know anything about the global's initializer to compute size? Commenting this out fixes the issue, but I don't know what the potential side effects are.

jdoerfert commented 4 years ago

Interestingly enough, this does not happen in C++:

The difference is the linkage type of do_log. It is removed in the middle end as well. See right windows here: https://godbolt.org/z/E-ab6I

9ffb4415-2ba4-4316-b043-121cc6098a61 commented 4 years ago

Interestingly enough, this does not happen in C++: https://godbolt.org/z/L7PgmE

/////////////////////////////

void errlog(char const*);

bool do_log; bool sadd(int A, int B, int *R) { if(__builtin_add_overflow(A,B,R)){ if (do_log) errlog("overflow"); return 1; } return 0; }

bool sadd_nolog(int A,int B, int *R) { bool r; bool old_log_settings = do_log; do_log=0; r = sadd(A,B,R); do_log=old_log_settings; return r; }

/////////////////////////////

Though there is still a dead load/store pair for some reason.

LebedevRI commented 4 years ago

This looks like load-store propagation + dead/paired store elimination to me, we have:

define dso_local zeroext i1 @​sadd_nolog(i32 %A, i32 %B, i32 nocapture %R) local_unnamed_addr #​0 { entry: %0 = load i8, i8 @​do_log, align 1, !tbaa !​2, !range !​6 store i8 0, i8 @​do_log, align 1, !tbaa !​2 %1 = tail call { i32, i1 } @​llvm.sadd.with.overflow.i32(i32 %A, i32 %B) #​3 %2 = extractvalue { i32, i1 } %1, 1 %3 = extractvalue { i32, i1 } %1, 0 store i32 %3, i32 %R, align 4 br i1 %2, label %if.then.i, label %sadd.exit

if.then.i: ; preds = %entry %4 = load i8, i8* @​do_log, align 1, !tbaa !​2, !range !​6 %tobool.i = icmp eq i8 %4, 0 br i1 %tobool.i, label %sadd.exit, label %if.then1.i

if.then1.i: ; preds = %if.then.i tail call void @​errlog(i8 getelementptr inbounds ([9 x i8], [9 x i8] @.str, i64 0, i64 0)) #​4 br label %sadd.exit

sadd.exit: ; preds = %entry, %if.then.i, %if.then1.i %retval.0.i = phi i1 [ true, %if.then.i ], [ true, %if.then1.i ], [ false, %entry ] store i8 %0, i8* @​do_log, align 1, !tbaa !​2 ret i1 %retval.0.i }

basic block %if.then.i is only entered from %entry basic block, so the %4 we loaded we know is always 0 because that is the value we just stored, and that store couldn't have been overriden by anything.

If we do propagate the load from store, we still have redundant load-store-load pairs,

https://gcc.godbolt.org/z/KLaeiH