[SYCL] Inefficient Loop Unrolling

anton-v-gorshkov commented 3 years ago

Bug Description

For a specific loop that could be unrolled, compiler generates inefficient native binary.

Loop looks like this (kernlels.cc:350 in reproducer):

#pragma unroll
for (int k = 0; k < N; ++k) {
  const int colIdx = x + SUB_GROUP_SIZE * k;
  if (colIdx == col) {
    currentLine[k] = ljj > 0 ? (currentLine[k] - sum) / (ljj + 1e-7f) : 0.0f;
  }
}

Here N is constexpr equals 2 so compiler should be able to unroll this loop. In reality one can observe twice more dynamic instruction count for the loop that leads to visible performance overhead. Not sure though it's connected with unrolling directly.

Workaround may look like this (kernlels.cc:358 in reproducer):

if (x == col) {
  currentLine[0] = ljj > 0 ? (currentLine[0] - sum) / (ljj + 1e-7f) : 0.0f;
}
if (x + SUB_GROUP_SIZE == col) {
  currentLine[1] = ljj > 0 ? (currentLine[1] - sum) / (ljj + 1e-7f) : 0.0f;
}

This code works as expected without dynamic instruction count oversupply.

To Reproduce

Reproducer is in attachment: cholesky.zip

$ cd cholesky
$ export SYCL_PROGRAM_COMPILE_OPTIONS=" -cl-fp32-correctly-rounded-divide-sqrt "
$ ./build_loop.sh # Inefficient version
$ ./cholesky_loop
Target device: Intel(R) HD Graphics 530 [0x1912]
Rows: 64 Matricies:  1176 Min:  41.9308 Avg:  42.2453 Max:  42.7523 [ms]
$ ./build_unroll.sh # Version with workaround
$ ./cholesky_unroll
Target device: Intel(R) HD Graphics 530 [0x1912]
Rows: 64 Matricies:  1176 Min:  36.5261 Avg:  36.7250 Max:  36.9940 [ms]

Environment (please complete the following information):

OS: Ubuntu 18.04
Target device and vendor: Intel(R) HD Graphics 530 (the same behavior is observed for NVidia RTX 2070)
DPC++ version: clang version 13.0.0 (https://github.com/intel/llvm d482ac39a6addd1da6ada1a8f11530d1c82b7dbc)
Dependencies version: NEO 21.24.20098 (CUDA 11.1)

THIS ISSUE WAS FIXED FOR INTEL GRAPHICS BUT STILL RELEVANT FOR NVIDIA GPUS

To Reproduce

Reproducer is in attachment: cholesky_nv.zip

$ unzip cholesky_nv.zip
$ cd cholesky_nv
$ export SYCL_DEVICE_FILTER=cuda
$ ./build_loop.sh # Inefficient version
$ ./cholesky_loop
Target device: NVIDIA GeForce RTX 2070
Rows: 64 Matricies:  1176 Min:   2.7296 Avg:   3.3601 Max:   3.3983 [ms]
$ ./build_unroll.sh # Version with workaround
$ ./cholesky_unroll
Target device: NVIDIA GeForce RTX 2070
Rows: 64 Matricies:  1176 Min:   2.5495 Avg:   2.7219 Max:   2.7673 [ms]

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be automatically closed in 30 days.

anton-v-gorshkov commented 2 years ago

This issue was fixed for Intel GPUs but still relevant for NVidia - so we have to fix it I guess.

bader commented 2 years ago

This issue was fixed for Intel GPUs but still relevant for NVidia - so we have to fix it I guess.

The ticket description doesn't mention that this the issue for NVidia. Could you check if it's the case and update the description, please?

anton-v-gorshkov commented 2 years ago

Hi Alexey! Thanks for noticing that, updated.

jchlanda commented 2 years ago

Hi @anton-v-gorshkov, I've looked into the issue a bit. The problem is not with the compiler failing to unroll the loop, it is being correctly unrolled, see a debug snippet from loop unroll pass:

Loop Unroll: F[_Z25CholeskyDecompositionImplPfiN2cl4sycl7nd_itemILi1EEE] Loop %for.body109
  Loop Size = 21
  Exiting block %if.end126: TripCount=2, TripMultiple=0, BreakoutTrip=0
COMPLETELY UNROLLING loop %for.body109 with trip count 2!

I've changed the floating point constants, so it's easier to chase the values:

#ifdef LOOP
#pragma unroll
      for (int k = 0; k < N; ++k) {
        const int colIdx = x + 32 * k;
        if (colIdx == col) {
          currentLine[k] =
              ljj > 0 ? (currentLine[k] - sum) / (ljj + 5.0f) : 6.0f;
        }
      }
#elif defined MANUAL_UNROLLING
      if (x == col) {
        currentLine[0] =
            ljj > 0 ? (currentLine[0] - sum) / (ljj + 5.0f) : 6.0f;
      }
      if (x + 32 == col) {
        currentLine[1] =
            ljj > 0 ? (currentLine[1] - sum) / (ljj + 5.0f) : 6.0f;
      }
#else

The real problem here is the fact that manually unrolled code has significantly more complicated control flow. The snippet in question ends up being split to 7 basic blocks (manual unroll), vs 5 (default). I'm going to use IR here, but exact the same principles can be observed in generated PTX.

manual unroll:

_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392: ; preds = %for.body.i391, %if.end91
  %val.addr.0.lcssa.i385 = phi float [ %tmp.1, %if.end91 ], [ %add.i.i389, %for.body.i391 ]
  %25 = bitcast float %val.addr.0.lcssa.i385 to i32
  %26 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 -1, i32 %25, i32 %rem411, i32 31)
  %27 = bitcast i32 %26 to float
  %28 = load volatile float, float* %arrayidx, align 4, !tbaa !16
  %cmp106 = icmp eq i32 %conv7, %col.0412
  br i1 %cmp106, label %if.then107, label %if.end117

if.then107:                                                 ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392 
  %cmp108 = fcmp ogt float %28, 0.000000e+00
  br i1 %cmp108, label %cond.true109, label %if.end117

cond.true109:                                             ; preds = %if.then107
  %sub = fsub float %currentLine.sroa.0.0414, %27
  %add111 = fadd float %28, 5.000000e+00
  %div112 = fdiv float %sub, %add111
  br label %if.end117

if.end117:                                        ; preds = %cond.true109, %if.then107, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392
  %currentLine.sroa.0.1 = phi float [ %currentLine.sroa.0.0414, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit392 ], [ %div112, %cond.true109 ], [ 6.000000e+00, %if.then107 ]
  %cmp119 = icmp eq i32 %add25, %col.0412
  br i1 %cmp119, label %if.then120, label %if.end131

if.then120:  if ljj > 0                                    ; preds = %if.end117
  %cmp121 = fcmp ogt float %28, 0.000000e+00
  br i1 %cmp121, label %cond.true122, label %if.end131

cond.true122:                                     ; preds = %if.then120
  %sub124 = fsub float %currentLine.sroa.11.0415, %27
  %add125 = fadd float %28, 5.000000e+00
  %div126 = fdiv float %sub124, %add125
  br label %if.end131

if.end131:                                        ; preds = %cond.true122, %if.then120, %if.end117
  %currentLine.sroa.11.1 = phi float [ %currentLine.sroa.11.0415, %if.end117 ], [ %div126, %cond.true122 ], [ 6.000000e+00, %if.then120 ]
  tail call void @llvm.nvvm.bar.warp.sync(i32 -1) #6
  %inc = add nuw nsw i32 %col.0412, 1
  %exitcond.not = icmp eq i32 %inc, %row.0416
  br i1 %exitcond.not, label %for.cond.cleanup43, label %for.body44, !llvm.loop !23

default loop:

_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378: ; preds = %for.body.i377, %if.end91
  %val.addr.0.lcssa.i371 = phi float [ %tmp.1, %if.end91 ], [ %add.i.i375, %for.body.i377 ]
  %25 = bitcast float %val.addr.0.lcssa.i371 to i32
  %26 = tail call i32 @llvm.nvvm.shfl.sync.idx.i32(i32 -1, i32 %25, i32 %rem397, i32 31)
  %27 = bitcast i32 %26 to float
  %28 = load volatile float, float* %arrayidx, align 4, !tbaa !16
  %cmp115 = fcmp ogt float %28, 0.000000e+00
  %add119 = fadd float %28, 5.000000e+00
  %cmp113 = icmp ne i32 %conv7, %col.0399
  %cmp115.not = xor i1 %cmp115, true
  %brmerge = select i1 %cmp113, i1 true, i1 %cmp115.not
  %currentLine.sroa.0.0.mux = select i1 %cmp113, float %currentLine.sroa.0.0, float 6.000000e+00
  br i1 %brmerge, label %if.end126, label %cond.true116

cond.true116:                                     ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378
  %sub = fsub float %currentLine.sroa.0.0, %27
  %div120 = fdiv float %sub, %add119
  br label %if.end126

if.end126:                                        ; preds = %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378, %cond.true116
  %currentLine.sroa.0.1 = phi float [ %currentLine.sroa.0.0.mux, %_Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378 ], [ %div120, %cond.true116 ]
  %cmp113.1 = icmp ne i32 %add25, %col.0399
  %cmp115.not413 = xor i1 %cmp115, true
  %brmerge414 = select i1 %cmp113.1, i1 true, i1 %cmp115.not413
  %currentLine.sroa.9.0.mux = select i1 %cmp113.1, float %currentLine.sroa.9.0, float 6.000000e+00
  br i1 %brmerge414, label %if.end126.1, label %cond.true116.1

cond.true116.1:                                   ; preds = %if.end126
  %sub.1 = fsub float %currentLine.sroa.9.0, %27
  %div120.1 = fdiv float %sub.1, %add119
  br label %if.end126.1

if.end126.1:                                      ; preds = %if.end126, %cond.true116.1
  %currentLine.sroa.9.1 = phi float [ %currentLine.sroa.9.0.mux, %if.end126 ], [ %div120.1, %cond.true116.1 ]
  tail call void @llvm.nvvm.bar.warp.sync(i32 -1) #6
  %inc130 = add nuw nsw i32 %col.0399, 1
  %exitcond.not = icmp eq i32 %inc130, %row.0401
  br i1 %exitcond.not, label %for.cond.cleanup43, label %for.body44, !llvm.loop !23

In the default loop unrolled case, notice how the updates to currentLine[0] and [1] (%currentLine.sroa.0.0.mux and %currentLine.sroa.0.9.mux respectively) are always performed, as both _Z13ShuffleReduceIf8TCudaAddIfEET_iS2_iN2cl4sycl7nd_itemILi1EEET0_.exit378 and if.end126 are executed unconditionally; versus conditional update of corresponding values and a single phi to gather the final value in manually unrolled snippet (%currentLine.sroa.0.1 and %currentLine.sroa.11.1).

There are a couple of points worth mentioning here. Given the context of unrolling, among others, the all important saving of that jump back to the loop body, I'm not sure if the compiler is in the wrong not generating more complicated control flow.

This optimisation only really works because of the fact that loop count is equal to unroll factor and the innermost if can be broken down to exactly two separate conditions (multiply by 0 or not, again, number of cases is equal to the loop count), I'm not sure how easy it would be to generalize that (it's almost as if we asked the compiler to recognise a loop over 2 indices to be equivalent of an if/else block, seems like there are better ways of doing conditionals :) ).

Finally, this is not strictly oneAPI specific and if there is any work that could be done, it would end up in upstream, so perhaps it would be better to create a synthetic example and submit it against upstream bug-tracker?

Would you be able to share how this was tackled for Intel GPUs, maybe there is something that could be generalized?

anton-v-gorshkov commented 2 years ago

Thanks @jchlanda! I'm not a compiler guy, so we need to talk to Intel compiler folks here. @bader and @AlexeySachkov may help. Also note that due to Intel operations suspension in Russia, local folks (including myself) may be unavailable.

bader commented 2 years ago

Would you be able to share how this was tackled for Intel GPUs, maybe there is something that could be generalized? Thanks @jchlanda! I'm not a compiler guy, so we need to talk to Intel compiler folks here. @bader and @AlexeySachkov may help. Also note that due to Intel operations suspension in Russia, local folks (including myself) may be unavailable.

I found that the problem with compilation for Intel GPUs was wrong handling of unroll metadata in SPIR-V translator, which we fixed with https://github.com/KhronosGroup/SPIRV-LLVM-Translator/pull/1081/. NOTE: DPC++ leaves loop optimizations for IGC, so to debug the optimization process for Intel GPUs you should debug IGC compiler. https://github.com/intel/intel-graphics-compiler

jchlanda commented 1 year ago

I've revisited this ticket and decided to close it. The problem is not with LLVM failing to unroll, but the quality of the source code and developers exploiting the peculiarity of this sample.

I don't see any other possible valid transformation of the loop that would improve the code. It's not a simple invariant code motion plus loop versioning issue because the predicate of the condition is dependent on the iteration. While I agree that there might be a transformation that would be of value in the specific case, it's not going to improve code size and unlikely to improve dynamic instruction count in the general case and would ultimately work around inefficient control flow in the original code sample.

intel / llvm

[SYCL] Inefficient Loop Unrolling #4134