Closed mcourteaux closed 1 year ago
I implemented a new scheduling directive .disallow_partitioning()
to test this. Will pour this into a PR for review #7882.
I succesfully got the IR back to this: by adding:
denoise_conv_noisy.update()
.disallow_partitioning(rdom_conv_noisy.x)
.disallow_partitioning(rdom_conv_noisy.y)
;
to my schedule.
As a workaround, this is something that you can control in the algorithm, by rolling your own boundary conditions that don't include the "likely" intrinsic. I know of some codebases that do this to get loop partitioning in x but not y.
I agree that it should be controllable in the schedule though.
As a further use for this, sometimes Halide picks the wrong loop to partition. E.g. if you have a tiled loop and you want to remove a boundary condition in y, you can partition the loop over tiles (the outer loop), or you can partition the loop over the rows of a tile (the inner loop). I think currently we just default to partitioning the outermost one that will simplify away the likely. You should be able to choose though, and my "workaround" doesn't address this case at all.
Fixed by #7914
The title says well what happens, and I think most of us know that this happens, but here is an example anyway. Here is a loop that computes a 5x5 convolution:
Which, during lowering, gets "loop partitioned" to:
This happens to isolate out the cases where the boundary condition doesn't impact anything, and the code can run at full-speed without being bothered to evaluate the min/max functions. These optimizations make sense in the scenario where you are processing vectorized on CPU, and significant work needs to be done close to the boundaries to make this work vectorized and duplicate values into the lanes of the vector register. This partitioning especially makes sense for large regions where boundaries are negligible.
However, I believe this can be (very) undesirable in several situations:
Overall, after having thought about this for a while, and having encountered them several times in the past years, I think this should be controllable, and somehow part of the schedule. Halide's scheduling mechanics is exactly there to control how the loops behave. This lowering pass is out of our control right now, and I believe it is not always useful.
Regarding point (1), even when code is vectorized over
x
, still it makes sense to NOT loop partition over they
for-loop. The cost of theclamp(y, lowerbound_y, upperbound_y-1)
becomes super-cheap compared to thefor (x)
loop inside of that.For context, this became evident looking at the GPU conceptual Stmt, showing this:
This code does not seem SIMT-friendly at all: a lot of divergent branches, and lots of code.