llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29k stars 11.95k forks source link

[InstCombine] CPU2006/454.calculix 38% slow-down after `Set dead phi inputs to poison in more cases` #64686

Open vzakhari opened 1 year ago

vzakhari commented 1 year ago

The regression happened after https://reviews.llvm.org/rGd01aec4c769d50fb92e86decd41d077c94105841 with Flang and -Ofast -march=native on Icelake server. Reverting the patch brings performance back.

The whole slow-down is in the loopnest at e_c3d.f:675. There is many reported 4k aliasing events, which I think are caused by extra register spills inside the loop. llvm-mca also suggests that the code generated for the loopnest after the patch is much slower, than before the patch.

Files with more information and a reproducer:

mca_36.gz - the loopnest assembly and llvm-mca report before the patch (rev: c0a36a1)

mca_37.gz - the loopnest assembly and llvm-mca report after the patch (rev: d01aec4)

before_icp.llvm.gz - LLVM IR for e_c3d routine containing the loopnest

after_icp.llvm.gz - LLVM IR arter opt before_icp.llvm -S -o - --passes=instcombine

InstCombine reorders instructions in the loop body quite a bit, and it looks like this ends up increasing the register pressure somehow. It does not look like the reordering is the intended behavior of the patch, so can we avoid it?

vzakhari commented 1 year ago

@nikic, can you please take a look?

nikic commented 1 year ago

The diff looks beneficial at the IR level -- we basically end up realizing that a certain loop only has one iteration and can constant fold various instructions in the loop. Presumably the regression is due to something later in the pipeline, but I can't tell at a glance what it would be -- the test case and IR diffs are too large.

My only guess purely from looking at the IR is this pattern:

  %2014 = getelementptr [3 x [3 x [3 x [3 x double]]]], ptr %68, i64 0, <4 x i64> <i64 0, i64 1, i64 2, i64 poison>, i64 2, i64 %2009, i64 2, !dbg !331
  %wide.masked.gather1873 = call <4 x double> @llvm.masked.gather.v4f64.v4p0(<4 x ptr> %2014, i32 8, <4 x i1> <i1 true, i1 true, i1 true, i1 false>, <4 x double> poison), !dbg !331, !tbaa !60

Which repeats quite a few times. Previously the GEP index was a variable rather than a vector constant.