Open vzakhari opened 1 year ago
@nikic, can you please take a look?
The diff looks beneficial at the IR level -- we basically end up realizing that a certain loop only has one iteration and can constant fold various instructions in the loop. Presumably the regression is due to something later in the pipeline, but I can't tell at a glance what it would be -- the test case and IR diffs are too large.
My only guess purely from looking at the IR is this pattern:
%2014 = getelementptr [3 x [3 x [3 x [3 x double]]]], ptr %68, i64 0, <4 x i64> <i64 0, i64 1, i64 2, i64 poison>, i64 2, i64 %2009, i64 2, !dbg !331
%wide.masked.gather1873 = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> %2014, i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 false>, <4 x double> poison), !dbg !331, !tbaa !60
Which repeats quite a few times. Previously the GEP index was a variable rather than a vector constant.
The regression happened after https://reviews.llvm.org/rGd01aec4c769d50fb92e86decd41d077c94105841 with
Flang
and-Ofast -march=native
on Icelake server. Reverting the patch brings performance back.The whole slow-down is in the loopnest at
e_c3d.f:675
. There is many reported 4k aliasing events, which I think are caused by extra register spills inside the loop.llvm-mca
also suggests that the code generated for the loopnest after the patch is much slower, than before the patch.Files with more information and a reproducer:
mca_36.gz - the loopnest assembly and
llvm-mca
report before the patch (rev: c0a36a1)mca_37.gz - the loopnest assembly and
llvm-mca
report after the patch (rev: d01aec4)before_icp.llvm.gz - LLVM IR for
e_c3d
routine containing the loopnestafter_icp.llvm.gz - LLVM IR arter
opt before_icp.llvm -S -o - --passes=instcombine
InstCombine reorders instructions in the loop body quite a bit, and it looks like this ends up increasing the register pressure somehow. It does not look like the reordering is the intended behavior of the patch, so can we avoid it?