In this case, we are doing bad job when combing memset and memcpy in MemCpyOptimizer. It should be:
call void @llvm.memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %2, i8 0, i64 12, i1 false)
It isn't obvious to me that this transform is valid, at first glance. pt2 could have subclasses that store data in the padding that's overwritten by icc.
Extended Description
class pt { int x; int y; };
class pt2 { int x; char y; };
void foo(pt s) { s = {}; }
void bar(pt2 s) { s = {}; }
For foo case, codegen is fine. The padding is problematic here.
Clang: ret bar(pt2*): mov DWORD PTR [rdi], 0 mov BYTE PTR [rdi+4], 0 ret
ICC:
bar(pt2*): xor eax, eax #18.4 mov QWORD PTR [rdi], rax #18.8 ret
So ideally we should have: mov QWORD PTR [rdi], 0
define dso_local void @_Z3fooP2pt(%class.pt nocapture %0) local_unnamed_addr #0 { %2 = bitcast %class.pt %0 to i64 store i64 0, i64 %2, align 4 ret void }
define dso_local void @_Z3barP3pt2(%class.pt2 nocapture %0) local_unnamed_addr #1 { %2 = bitcast %class.pt2 %0 to i40 store i40 0, i40 %2, align 4, !tbaa.struct !2 ret void }
Looking at dumps, SROA to blame?
With class pt3 { int x; int y; char z; };
Unoptimized: define dso_local void @_Z3bazP3pt3(%class.pt3 %0) #0 { %2 = alloca %class.pt3, align 8 %3 = alloca %class.pt3, align 4 store %class.pt3* %0, %class.pt3 %2, align 8 %4 = bitcast %class.pt3 %3 to i8 call void @llvm.memset.p0i8.i64(i8 align 4 %4, i8 0, i64 12, i1 false) %5 = load %class.pt3, %class.pt3 %2, align 8 %6 = bitcast %class.pt3 %5 to i8 %7 = bitcast %class.pt3 %3 to i8 call void @llvm.memcpy.p0i8.p0i8.i64(i8 align 4 %6, i8 align 4 %7, i64 9, i1 false) ret void }
Optimized: define dso_local void @_Z3bazP3pt3(%class.pt3 nocapture %0) local_unnamed_addr #2 { %2 = bitcast %class.pt3 %0 to i8 call void @llvm.memset.p0i8.i64(i8 nonnull align 4 dereferenceable(9) %2, i8 0, i64 9, i1 false) ret void }
In this case, we are doing bad job when combing memset and memcpy in MemCpyOptimizer. It should be: call void @llvm.memset.p0i8.i64(i8* nonnull align 4 dereferenceable(12) %2, i8 0, i64 12, i1 false)