llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.26k stars 12.09k forks source link

[AMDGPU]Add+GEP->GEP commit performance degradation on AMDGPU #94092

Open lfmeadow opened 5 months ago

lfmeadow commented 5 months ago

AMD have traced some significant (8-20%) slowdowns on several of AMD's rocFFT kernels to the commit mentioned in https://github.com/llvm/llvm-project/issues/78214 . The llvm commit is https://github.com/llvm/llvm-project/commit/e13bed4c5f3544c076ce57e36d9a11eefa5a7815. Briefly, we have identified two problems:

  1. The commit causes simplifycfg to bail out for fear of creating too many PHI nodes (even though they don't really get created). This has cascade effects on the backend register allocation pipeline, causing the kernel to use many more Vector GPRs (VGPRS) on AMD CDNA architecture GPUs. This in turn drops occupancy with a corresponding decrease in performance. See simplifycfg-repro
  2. A more subtle effect occurs when the transformation results in more CSE opportunities. This ends up increasing the VGPR usage as well, similarly dropping the occupancy. The performance loss is less severe but is still in the 8% range. See regalloc-repro There are a lot of details on github as noted above. I will briefly summarize. For issue 1, simplifycfg bails with SINK: stopping here, too many PHIs would be created!. I hacked a tunable for simplifycfg as shown in the patch thus allowing 4 PHI nodes, and the resulting code regains the performance. For issue 2, there is no obvious smoking gun; as far as I can tell it just puts more pressure on the register allocation path to put partial address expressions in registers, with no obvious benefit.

Ideally the GEP rewrite commit would be withdrawn. I don't see any way to fix this otherwise. I do not understand why simplifycfg is failing to restructure this code.

Thanks for your consideration

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: Larry Meadows (lfmeadow)

AMD have traced some significant (8-20%) slowdowns on several of AMD's rocFFT kernels to the commit mentioned in #https://github.com/llvm/llvm-project/issues/78214 . The specific commit is
dtcxzyw commented 5 months ago

cc @nikic

nikic commented 5 months ago

For SimplifyCFG the problem is basically that, when considering sinking for each instruction individually, it would create many phi nodes. However, if we sink multiple instructions, it turns out that most of the phi nodes are actually not needed (because the phi operands are also sunk instructions). It should be possible to teach SimplifyCFG about that.

nikic commented 5 months ago

Hm actually it looks like there's already a check for this here: https://github.com/llvm/llvm-project/blob/86bb5c8427346aafaafa42fbf96e405ae4ca07bf/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L2342 So the issue must be something else...

nikic commented 5 months ago

I think the actual problem may be the one-use check in: https://github.com/llvm/llvm-project/blob/86bb5c8427346aafaafa42fbf96e405ae4ca07bf/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L1957-L1961 Some of the instructions also have an extra use in a phi node in the sink block, so they'll not be considered sinking candidates. Not immediately obvious to me why we need the one-use restriction.

nikic commented 5 months ago

Reduced test case to illustrate the problem: https://llvm.godbolt.org/z/q6Eqdj8K7

define ptr @test(i1 %c, ptr %p, i64 %a, i64 %b) {
    br i1 %c, label %if, label %else
if:
    call void @dummy()
    %gep1.a = getelementptr i8, ptr %p, i64 %a
    %gep2.a = getelementptr i8, ptr %gep1.a, i64 %b
    br label %join
else:
    %gep1.b = getelementptr i8, ptr %p, i64 %a
    %gep2.b = getelementptr i8, ptr %gep1.b, i64 %b
    br label %join
join:
    %phi1 = phi ptr [ %gep1.a, %if ], [ %gep1.b, %else ]
    %phi2 = phi ptr [ %gep2.a, %if ], [ %gep2.b, %else ]
    call void @use(ptr %phi1)
    ret ptr %phi2
}

declare void @dummy()
declare void @use(ptr)

Produces:


define ptr @test(i1 %c, ptr %p, i64 %a, i64 %b) {
  br i1 %c, label %if, label %else

if:                                               ; preds = %0
  call void @dummy()
  %gep1.a = getelementptr i8, ptr %p, i64 %a
  br label %join

else:                                             ; preds = %0
  %gep1.b = getelementptr i8, ptr %p, i64 %a
  br label %join

join:                                             ; preds = %else, %if
  %gep1.b.sink = phi ptr [ %gep1.b, %else ], [ %gep1.a, %if ]
  %phi1 = phi ptr [ %gep1.a, %if ], [ %gep1.b, %else ]
  %gep2.b = getelementptr i8, ptr %gep1.b.sink, i64 %b
  call void @use(ptr %phi1)
  ret ptr %gep2.b
}

This one is even extra bad because we end up with two identical phis that block the sinking that would remove those phis...

nikic commented 5 months ago

Preparatory PR: https://github.com/llvm/llvm-project/pull/94462 Once that lands, I have another patch to enable sinking of instructions with multiple uses.

lfmeadow commented 5 months ago

FYI The preparatory PR did not fix the problem in the real code; it also still complains about 'too many PHI' . I'd be glad to test your second patch when I can get it.

nikic commented 5 months ago

@lfmeadow The second patch is https://github.com/nikic/llvm-project/commit/baeab1f0db6b0c18dfa87a7abf87d8a675b308be. I haven't tried it on your test case though.

lfmeadow commented 5 months ago

@lfmeadow The second patch is nikic@baeab1f. I haven't tried it on your test case though.

Thank you. The two patches do fix the performance problem in two of our three fwd/bwd FFT pairs. The third pair still performs 8% worse; that problem is not related to simplifycfg.

nikic commented 5 months ago

The SimplifyCFG issue should be fixed by https://github.com/llvm/llvm-project/pull/95521. The other problem looks like something for AMDGPU backend developers to look at. From the description it sounds like rematerialization heuristics may need to be improved, but I know little about that.

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: Larry Meadows (lfmeadow)

AMD have traced some significant (8-20%) slowdowns on several of AMD's rocFFT kernels to the commit mentioned in https://github.com/llvm/llvm-project/issues/78214 . The llvm commit is https://github.com/llvm/llvm-project/commit/e13bed4c5f3544c076ce57e36d9a11eefa5a7815. Briefly, we have identified two problems: 1. The commit causes simplifycfg to bail out for fear of creating too many PHI nodes (even though they don't really get created). This has cascade effects on the backend register allocation pipeline, causing the kernel to use many more Vector GPRs (VGPRS) on AMD CDNA architecture GPUs. This in turn drops occupancy with a corresponding decrease in performance. See [simplifycfg-repro](https://github.com/lfmeadow/llvm-issues/tree/acf2bea04bbacb821fd81bb4cdd958c1dda6f5ad/simplifycfg-repro) 2. A more subtle effect occurs when the transformation results in more CSE opportunities. This ends up increasing the VGPR usage as well, similarly dropping the occupancy. The performance loss is less severe but is still in the 8% range. See [regalloc-repro](https://github.com/lfmeadow/llvm-issues/tree/acf2bea04bbacb821fd81bb4cdd958c1dda6f5ad/regalloc-repro) There are a lot of details on github as noted above. I will briefly summarize. For issue 1, simplifycfg bails with [SINK: stopping here, too many PHIs would be created!](https://github.com/lfmeadow/llvm-issues/blob/acf2bea04bbacb821fd81bb4cdd958c1dda6f5ad/simplifycfg-repro/opt-extract-true.log#L279-L281). I hacked a tunable for simplifycfg as shown in the [patch](https://github.com/lfmeadow/llvm-issues/blob/acf2bea04bbacb821fd81bb4cdd958c1dda6f5ad/simplifycfg-repro/gep.patch#:~:text=gep.-,patch,-kernel%2Dgep%2Dphi4) thus allowing 4 PHI nodes, and the resulting code regains the performance. For issue 2, there is no obvious smoking gun; as far as I can tell it just puts more pressure on the register allocation path to put partial address expressions in registers, with no obvious benefit. Ideally the GEP rewrite commit would be withdrawn. I don't see any way to fix this otherwise. I do not understand why simplifycfg is failing to restructure this code. Thanks for your consideration
appujee commented 5 months ago

I wonder if using gvnsink would have addressed this?