Open c8cd4046-bf2f-4470-bb42-add0a3666ec0 opened 4 years ago
Apparently "opt -dse" partially optimizes that testcase, but "opt -dse -dse" eliminates the load. I guess DSE has pass-ordering issues relative to itself.
Here's another interesting testcase:
///////////////////////// NOT OPTIMIZED @do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @do_log, align 1 store i8 0, i8 @do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8* @do_log, align 1 ret i1 1 } ////////////////////////
///////////////////////// OPTIMIZED @do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @do_log, align 1 store i8 0, i8 @do_log, align 1 store i8 %4, i8 @do_log, align 1 ret i1 1 } ////////////////////////
Seems as though the store in the middle prevents the optimizations from doing their best, but again I don't exactly know why.
I'd guess dead store elimination runs to late and we do not run load to store optimizations (of some form in some pass) after.
I have a reduced testcase:
///////////////////////// NOT OPTIMIZED @do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr { %4 = load i8, i8 @do_log, align 1 store i8 0, i8 @do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8* @do_log, align 1 ret i1 1 } ////////////////////////
//////////////////////// OPTIMIZED @do_log = dso_local local_unnamed_addr global i8 0, align 1 define dso_local zeroext i1 @sadd_nolog(i32 %0, i32 %1, i32 nocapture %2) local_unnamed_addr #0 { %4 = load i8, i8 @do_log, align 1 store i32 0, i32 %2, align 4 store i8 %4, i8 @do_log, align 1 ret i1 true } /////////////////////////
I think this has something to do with how DSE is done, but I don't know.
One needs to figure out what pass does eventually remove the load store pair: https://gcc.godbolt.org/z/9qBvuP it's gone in the lower part after we run -Os again. Then one can probably figure out why it didn't trigger the first time around.
Interesting. Providing an initializer (_Bool do_log; => _Bool do_log=0;) stops clang from acting as if it didn't know what the last stored value to do_log was.
Adding _Thread_local has the same effect (even without an explicit initializer).
https://gcc.godbolt.org/z/3NYF2H
But there's still the load/store pair from/to the global, which effectively does nothing:
sadd_nolog: mov cl, byte ptr [rip + do_log] ///<<<<<<<<<unnecessary add edi, esi seto al mov dword ptr [rdx], edi mov byte ptr [rip + do_log], cl ///<<<<<<<<<unnecessary ret
It seems like the root cause of this is in ObjectSizeOffsetVisitor::visitGlobalVariable (MemoryBuiltins.cpp)
///////////////////////////////////// if (!GV.hasDefinitiveInitializer()) return unknown(); /////////////////////////////////////
I'm kind of confused as to why this is here. Do we need to know anything about the global's initializer to compute size? Commenting this out fixes the issue, but I don't know what the potential side effects are.
Interestingly enough, this does not happen in C++:
The difference is the linkage type of do_log
.
It is removed in the middle end as well. See right windows here: https://godbolt.org/z/E-ab6I
Interestingly enough, this does not happen in C++: https://godbolt.org/z/L7PgmE
/////////////////////////////
void errlog(char const*);
bool do_log; bool sadd(int A, int B, int *R) { if(__builtin_add_overflow(A,B,R)){ if (do_log) errlog("overflow"); return 1; } return 0; }
bool sadd_nolog(int A,int B, int *R) { bool r; bool old_log_settings = do_log; do_log=0; r = sadd(A,B,R); do_log=old_log_settings; return r; }
/////////////////////////////
Though there is still a dead load/store pair for some reason.
This looks like load-store propagation + dead/paired store elimination to me, we have:
define dso_local zeroext i1 @sadd_nolog(i32 %A, i32 %B, i32 nocapture %R) local_unnamed_addr #0 { entry: %0 = load i8, i8 @do_log, align 1, !tbaa !2, !range !6 store i8 0, i8 @do_log, align 1, !tbaa !2 %1 = tail call { i32, i1 } @llvm.sadd.with.overflow.i32(i32 %A, i32 %B) #3 %2 = extractvalue { i32, i1 } %1, 1 %3 = extractvalue { i32, i1 } %1, 0 store i32 %3, i32 %R, align 4 br i1 %2, label %if.then.i, label %sadd.exit
if.then.i: ; preds = %entry %4 = load i8, i8* @do_log, align 1, !tbaa !2, !range !6 %tobool.i = icmp eq i8 %4, 0 br i1 %tobool.i, label %sadd.exit, label %if.then1.i
if.then1.i: ; preds = %if.then.i tail call void @errlog(i8 getelementptr inbounds ([9 x i8], [9 x i8] @.str, i64 0, i64 0)) #4 br label %sadd.exit
sadd.exit: ; preds = %entry, %if.then.i, %if.then1.i %retval.0.i = phi i1 [ true, %if.then.i ], [ true, %if.then1.i ], [ false, %entry ] store i8 %0, i8* @do_log, align 1, !tbaa !2 ret i1 %retval.0.i }
basic block %if.then.i is only entered from %entry basic block, so the %4 we loaded we know is always 0 because that is the value we just stored, and that store couldn't have been overriden by anything.
If we do propagate the load from store, we still have redundant load-store-load pairs,
Extended Description
Gcc can very nicely optimize out modifications to global/thread objects if the new value is only used locally (no opaque calls made) and the modification is then reverted.
For example, for:
https://gcc.godbolt.org/z/UL79D3
gcc -Os
generates an 8-byte function on x86-64 whileclang -Os
generates one that's 64 bytes large.I don't know how difficult it is to implement such an optimization, but it would be convenient if it could be an optimization that could be counted upon.