halide / Halide

a language for fast, portable data-parallel computation
https://halide-lang.org
Other
5.9k stars 1.07k forks source link

Loops get duplicated in partition loops to "simplify" boundary conditions such as repeat_edge, which is not always desirable I believe. #7881

Closed mcourteaux closed 1 year ago

mcourteaux commented 1 year ago

The title says well what happens, and I think most of us know that this happens, but here is an example anyway. Here is a loop that computes a 5x5 convolution:

produce denoise_conv_noisy {
 denoise_conv_noisy[0] = 0.000000f
 for (denoise_conv_noisy.s1.rdom_conv_noisy$y, -2, 5) {
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, -2, 5) {
   denoise_conv_noisy[0] = denoise_conv_noisy[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_filt
32(noisy[(min(likely(denoised.s0.ci), noisy.extent.2 + -1)*noisy.stride.2) + (max(min(likely((((denoised.s0.x.xi.bx.__block_id_x*8) + denoised.s0.x.xi.tx.__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x), noisy.extent.0 + -1), 0) + (max(min(likely((((denoised.s0.y.yi.by.__block_id_y*8) + denoised
thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$y), noisy.extent.1 + -1), 0)*noisy.stride.1))]))*0.000015f)
  }
 }
}

Which, during lowering, gets "loop partitioned" to:

produce denoise_conv_noisy {
 denoise_conv_noisy.3[0] = 0.000000f
 let denoise_conv_noisy.s1.rdom_conv_noisy$y.epilogue = max(min(noisy.extent.1 - (((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base), 3), 0 - max(min(((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base, 2), -3))
 for (denoise_conv_noisy.s1.rdom_conv_noisy$y, -2, 2 - max(min(((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base, 2), -3)) {
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, -2, 5) {
   denoise_conv_noisy.3[0] = denoise_conv_noisy.3[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_
loat32(noisy[(denoised.s0.ci*noisy.stride.2) + (max(min((((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x, noisy.extent.0 + -1), 0) + (max(min((((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_co
isy.extent.1 + -1), 0)*noisy.stride.1))]))*0.000015f)
  }
 }
 for (denoise_conv_noisy.s1.rdom_conv_noisy$y, 0 - max(min(((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base, 2), -3), max(min(((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base, 2), -3) + denoise_conv_noisy.s1.rdom_conv_noisy$y.epilogue) {
  let denoise_conv_noisy.s1.rdom_conv_noisy$x.epilogue = max(min(noisy.extent.0 - (((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base), 3), 0 - max(min(((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base, 2), -3))
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, -2, 2 - max(min(((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base, 2), -3)) {
   denoise_conv_noisy.3[0] = denoise_conv_noisy.3[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_
loat32(noisy[(denoised.s0.ci*noisy.stride.2) + (max(min((((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x, noisy.extent.0 + -1), 0) + (((((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_conv_nois
ide.1))]))*0.000015f)
  }
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, 0 - max(min(((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base, 2), -3), max(min(((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base, 2), -3) + denoise_conv_noisy.s1.rdom_conv_noisy$x.epilogue) {
   denoise_conv_noisy.3[0] = denoise_conv_noisy.3[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_
loat32(noisy[(denoised.s0.ci*noisy.stride.2) + ((((((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$y)*noisy.stride.1) + ((((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x))]))*0.00001

  }
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, denoise_conv_noisy.s1.rdom_conv_noisy$x.epilogue, 3 - denoise_conv_noisy.s1.rdom_conv_noisy$x.epilogue) {
   denoise_conv_noisy.3[0] = denoise_conv_noisy.3[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_
loat32(noisy[(denoised.s0.ci*noisy.stride.2) + (max(min((((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x, noisy.extent.0 + -1), 0) + (((((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_conv_nois
ide.1))]))*0.000015f)
  }
 }
 for (denoise_conv_noisy.s1.rdom_conv_noisy$y, denoise_conv_noisy.s1.rdom_conv_noisy$y.epilogue, 3 - denoise_conv_noisy.s1.rdom_conv_noisy$y.epilogue) {
  for (denoise_conv_noisy.s1.rdom_conv_noisy$x, -2, 5) {
   denoise_conv_noisy.3[0] = denoise_conv_noisy.3[0] + ((blurring_filters_noisy[((blurring_filters_noisy.stride.1 + blurring_filters_noisy.stride.2)*2) + ((blurring_filters_noisy.stride.2*denoise_conv_noisy.s1.rdom_conv_noisy$y) + ((blurring_filters_noisy.stride.1*denoise_conv_noisy.s1.rdom_conv_noisy$x) + denoise_accum.s1.rdom_
loat32(noisy[(denoised.s0.ci*noisy.stride.2) + (max(min((((denoised.s0.x.xi.bx.__block_id_x*8) + .__thread_id_x) + denoised.s0.x.xi.base) + denoise_conv_noisy.s1.rdom_conv_noisy$x, noisy.extent.0 + -1), 0) + (max(min((((denoised.s0.y.yi.by.__block_id_y*8) + .__thread_id_y) + denoised.s0.y.yi.base) + denoise_conv_noisy.s1.rdom_co
isy.extent.1 + -1), 0)*noisy.stride.1))]))*0.000015f)
  }
 }
}

This happens to isolate out the cases where the boundary condition doesn't impact anything, and the code can run at full-speed without being bothered to evaluate the min/max functions. These optimizations make sense in the scenario where you are processing vectorized on CPU, and significant work needs to be done close to the boundaries to make this work vectorized and duplicate values into the lanes of the vector register. This partitioning especially makes sense for large regions where boundaries are negligible.

However, I believe this can be (very) undesirable in several situations:

  1. When code is not vectorized, two min/max operations are cheap.
  2. When code size needs to be small.
  3. When programming GPUs, this creates many divergent codepaths that are not able to execute simultaneously, if I understand correctly.
  4. When branch prediction starts ruining performance.
  5. When the bounds are small, and benefit is absolutely minimal.
  6. When working on the schedule, this causes a lot of duplication which is hard to digest when reading the Stmt code.

Overall, after having thought about this for a while, and having encountered them several times in the past years, I think this should be controllable, and somehow part of the schedule. Halide's scheduling mechanics is exactly there to control how the loops behave. This lowering pass is out of our control right now, and I believe it is not always useful.


Regarding point (1), even when code is vectorized over x, still it makes sense to NOT loop partition over the y for-loop. The cost of the clamp(y, lowerbound_y, upperbound_y-1) becomes super-cheap compared to the for (x) loop inside of that.


For context, this became evident looking at the GPU conceptual Stmt, showing this:

image

This code does not seem SIMT-friendly at all: a lot of divergent branches, and lots of code.

mcourteaux commented 1 year ago

I implemented a new scheduling directive .disallow_partitioning() to test this. Will pour this into a PR for review #7882.

I succesfully got the IR back to this: image by adding:

        denoise_conv_noisy.update()
            .disallow_partitioning(rdom_conv_noisy.x)
            .disallow_partitioning(rdom_conv_noisy.y)
            ;

to my schedule.

abadams commented 1 year ago

As a workaround, this is something that you can control in the algorithm, by rolling your own boundary conditions that don't include the "likely" intrinsic. I know of some codebases that do this to get loop partitioning in x but not y.

I agree that it should be controllable in the schedule though.

abadams commented 1 year ago

As a further use for this, sometimes Halide picks the wrong loop to partition. E.g. if you have a tiled loop and you want to remove a boundary condition in y, you can partition the loop over tiles (the outer loop), or you can partition the loop over the rows of a tile (the inner loop). I think currently we just default to partitioning the outermost one that will simplify away the likely. You should be able to choose though, and my "workaround" doesn't address this case at all.

abadams commented 1 year ago

Fixed by #7914