Closed anton-v-gorshkov closed 1 year ago
This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.
This issue was fixed for Intel GPUs but still relevant for NVidia - so we have to fix it I guess.
This issue was fixed for Intel GPUs but still relevant for NVidia - so we have to fix it I guess.
The ticket description doesn't mention that this the issue for NVidia. Could you check if it's the case and update the description, please?
Hi Alexey! Thanks for noticing that, updated.
Hi @anton-v-gorshkov, I've looked into the issue a bit. The problem is not with the compiler failing to unroll the loop, it is being correctly unrolled, see a debug snippet from loop unroll pass:
Loop Unroll: F[_Z25CholeskyDecompositionImplPfiN2cl4sycl7nd_itemILi1EEE] Loop %for.body109
Loop Size = 21
Exiting block %if.end126: TripCount=2, TripMultiple=0, BreakoutTrip=0
COMPLETELY UNROLLING loop %for.body109 with trip count 2!
I've changed the floating point constants, so it's easier to chase the values:
#ifdef LOOP
#pragma unroll
for (int k = 0; k < N; ++k) {
const int colIdx = x + 32 * k;
if (colIdx == col) {
currentLine[k] =
ljj > 0 ? (currentLine[k] - sum) / (ljj + 5.0f) : 6.0f;
}
}
#elif defined MANUAL_UNROLLING
if (x == col) {
currentLine[0] =
ljj > 0 ? (currentLine[0] - sum) / (ljj + 5.0f) : 6.0f;
}
if (x + 32 == col) {
currentLine[1] =
ljj > 0 ? (currentLine[1] - sum) / (ljj + 5.0f) : 6.0f;
}
#else
The real problem here is the fact that manually unrolled code has significantly more complicated control flow. The snippet in question ends up being split to 7 basic blocks (manual unroll), vs 5 (default). I'm going to use IR here, but exact the same principles can be observed in generated PTX.
_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392: ; preds = %for.body.i391, %if.end91
%val.addr.0.lcssa.i385 = phi float [ %tmp.1, %if.end91 ], [ %add.i.i389, %for.body.i391 ]
%25 = bitcast float %val.addr.0.lcssa.i385 to i32
%26 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 -1, i32 %25, i32 %rem411, i32 31)
%27 = bitcast i32 %26 to float
%28 = load volatile float, float* %arrayidx, align 4, !tbaa !16
%cmp106 = icmp eq i32 %conv7, %col.0412
br i1 %cmp106, label %if.then107, label %if.end117
if.then107: ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392
%cmp108 = fcmp ogt float %28, 0.000000e+00
br i1 %cmp108, label %cond.true109, label %if.end117
cond.true109: ; preds = %if.then107
%sub = fsub float %currentLine.sroa.0.0414, %27
%add111 = fadd float %28, 5.000000e+00
%div112 = fdiv float %sub, %add111
br label %if.end117
if.end117: ; preds = %cond.true109, %if.then107, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392
%currentLine.sroa.0.1 = phi float [ %currentLine.sroa.0.0414, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392 ], [ %div112, %cond.true109 ], [ 6.000000e+00, %if.then107 ]
%cmp119 = icmp eq i32 %add25, %col.0412
br i1 %cmp119, label %if.then120, label %if.end131
if.then120: if ljj > 0 ; preds = %if.end117
%cmp121 = fcmp ogt float %28, 0.000000e+00
br i1 %cmp121, label %cond.true122, label %if.end131
cond.true122: ; preds = %if.then120
%sub124 = fsub float %currentLine.sroa.11.0415, %27
%add125 = fadd float %28, 5.000000e+00
%div126 = fdiv float %sub124, %add125
br label %if.end131
if.end131: ; preds = %cond.true122, %if.then120, %if.end117
%currentLine.sroa.11.1 = phi float [ %currentLine.sroa.11.0415, %if.end117 ], [ %div126, %cond.true122 ], [ 6.000000e+00, %if.then120 ]
tail call void @llvm.nvvm.bar.warp.sync(i32 -1) #6
%inc = add nuw nsw i32 %col.0412, 1
%exitcond.not = icmp eq i32 %inc, %row.0416
br i1 %exitcond.not, label %for.cond.cleanup43, label %for.body44, !llvm.loop !23
_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378: ; preds = %for.body.i377, %if.end91
%val.addr.0.lcssa.i371 = phi float [ %tmp.1, %if.end91 ], [ %add.i.i375, %for.body.i377 ]
%25 = bitcast float %val.addr.0.lcssa.i371 to i32
%26 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 -1, i32 %25, i32 %rem397, i32 31)
%27 = bitcast i32 %26 to float
%28 = load volatile float, float* %arrayidx, align 4, !tbaa !16
%cmp115 = fcmp ogt float %28, 0.000000e+00
%add119 = fadd float %28, 5.000000e+00
%cmp113 = icmp ne i32 %conv7, %col.0399
%cmp115.not = xor i1 %cmp115, true
%brmerge = select i1 %cmp113, i1 true, i1 %cmp115.not
%currentLine.sroa.0.0.mux = select i1 %cmp113, float %currentLine.sroa.0.0, float 6.000000e+00
br i1 %brmerge, label %if.end126, label %cond.true116
cond.true116: ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378
%sub = fsub float %currentLine.sroa.0.0, %27
%div120 = fdiv float %sub, %add119
br label %if.end126
if.end126: ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378, %cond.true116
%currentLine.sroa.0.1 = phi float [ %currentLine.sroa.0.0.mux, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378 ], [ %div120, %cond.true116 ]
%cmp113.1 = icmp ne i32 %add25, %col.0399
%cmp115.not413 = xor i1 %cmp115, true
%brmerge414 = select i1 %cmp113.1, i1 true, i1 %cmp115.not413
%currentLine.sroa.9.0.mux = select i1 %cmp113.1, float %currentLine.sroa.9.0, float 6.000000e+00
br i1 %brmerge414, label %if.end126.1, label %cond.true116.1
cond.true116.1: ; preds = %if.end126
%sub.1 = fsub float %currentLine.sroa.9.0, %27
%div120.1 = fdiv float %sub.1, %add119
br label %if.end126.1
if.end126.1: ; preds = %if.end126, %cond.true116.1
%currentLine.sroa.9.1 = phi float [ %currentLine.sroa.9.0.mux, %if.end126 ], [ %div120.1, %cond.true116.1 ]
tail call void @llvm.nvvm.bar.warp.sync(i32 -1) #6
%inc130 = add nuw nsw i32 %col.0399, 1
%exitcond.not = icmp eq i32 %inc130, %row.0401
br i1 %exitcond.not, label %for.cond.cleanup43, label %for.body44, !llvm.loop !23
In the default loop unrolled case, notice how the updates to currentLine[0]
and [1]
(%currentLine.sroa.0.0.mux
and %currentLine.sroa.0.9.mux
respectively) are always performed, as both _Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378
and if.end126
are executed unconditionally; versus conditional update of corresponding values and a single phi
to gather the final value in manually unrolled snippet (%currentLine.sroa.0.1
and %currentLine.sroa.11.1
).
There are a couple of points worth mentioning here. Given the context of unrolling, among others, the all important saving of that jump back to the loop body, I'm not sure if the compiler is in the wrong not generating more complicated control flow.
This optimisation only really works because of the fact that loop count is equal to unroll factor and the innermost if
can be broken down to exactly two separate conditions (multiply by 0
or not, again, number of cases is equal to the loop count), I'm not sure how easy it would be to generalize that (it's almost as if we asked the compiler to recognise a loop over 2 indices to be equivalent of an if/else block, seems like there are better ways of doing conditionals :) ).
Finally, this is not strictly oneAPI specific and if there is any work that could be done, it would end up in upstream, so perhaps it would be better to create a synthetic example and submit it against upstream bug-tracker?
Would you be able to share how this was tackled for Intel GPUs, maybe there is something that could be generalized?
Thanks @jchlanda! I'm not a compiler guy, so we need to talk to Intel compiler folks here. @bader and @AlexeySachkov may help. Also note that due to Intel operations suspension in Russia, local folks (including myself) may be unavailable.
Would you be able to share how this was tackled for Intel GPUs, maybe there is something that could be generalized? Thanks @jchlanda! I'm not a compiler guy, so we need to talk to Intel compiler folks here. @bader and @AlexeySachkov may help. Also note that due to Intel operations suspension in Russia, local folks (including myself) may be unavailable.
I found that the problem with compilation for Intel GPUs was wrong handling of unroll metadata in SPIR-V translator, which we fixed with https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/1081/. NOTE: DPC++ leaves loop optimizations for IGC, so to debug the optimization process for Intel GPUs you should debug IGC compiler. https://github.com/intel/intel-graphics-compiler
I've revisited this ticket and decided to close it. The problem is not with LLVM failing to unroll, but the quality of the source code and developers exploiting the peculiarity of this sample.
I don't see any other possible valid transformation of the loop that would improve the code. It's not a simple invariant code motion plus loop versioning issue because the predicate of the condition is dependent on the iteration. While I agree that there might be a transformation that would be of value in the specific case, it's not going to improve code size and unlikely to improve dynamic instruction count in the general case and would ultimately work around inefficient control flow in the original code sample.
Bug Description
For a specific loop that could be unrolled, compiler generates inefficient native binary.
Loop looks like this (kernlels.cc:350 in reproducer):
Here
N
is constexpr equals 2 so compiler should be able to unroll this loop. In reality one can observe twice more dynamic instruction count for the loop that leads to visible performance overhead. Not sure though it's connected with unrolling directly.Workaround may look like this (kernlels.cc:358 in reproducer):
This code works as expected without dynamic instruction count oversupply.
To Reproduce
Reproducer is in attachment: cholesky.zip
Environment (please complete the following information):
THIS ISSUE WAS FIXED FOR INTEL GRAPHICS BUT STILL RELEVANT FOR NVIDIA GPUS
To Reproduce
Reproducer is in attachment: cholesky_nv.zip