iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.78k stars 603 forks source link

Scheduling multiple linalg.depthwise_conv_2d_nhwc_hwc and non trivial tensor.pad into one flow dispatch region. #11549

Open dpackwood opened 1 year ago

dpackwood commented 1 year ago

Hi

I am interested in potentially grouping multiple convolutions and non-trivial pads into the same flow dispatch region.

The most simple example would be something like this: conv_fuse_test_case.mlir.txt This represents an implementation of a separable depthwise convolution, as might be used in many image processing tasks.

I understand the transform dialect might help here and I intend to have a try myself, any help is welcome but the above will also motivate a more complex example: pad_conv_fuse_test_case.mlir.txt In this case we add non trivial padding, where edge pixels are replicated into the padding region.

In this case my observation is that IREE actually merges the pads into the same dispatch regions as the following convs, but doesn't apply workgroup tiling and eventually the pad seems to be perhaps completely unrolled across the entire image (its hard to spot in stderr because so much IR is printed so rapidly).

I think one approach here is to break the image into multiple dispatch regions where boundary and bulk are handled separately. So for this particular example one might have something like 9 dispatch regions (4 corners, 4 sides, bulk).

I wonder if such an approach makes sense? I have a rather hacky implementation of a pass which does the separation for one pad/conv combination but haven't looked at the tensor.pad unrolling issue so my final generated .vmfb is not performant anyway.

benvanik commented 1 year ago

If you're seeing some crazy unrolling that's bad - we really shouldn't even have that lever in the compiler and fall back to scalar loops. I feel like this comes up about once a week and it'd be really good to fix :) /cc @dcaballe

Before going too far with that test case it'd be good to look at a larger one - it's really hard to make decisions based on microbenchmarks as what may be a small win in a microbenchmark (slightly different consumer padding fusion behavior to handle unpadded loads, masking/swizzling, boundary conditions, etc) usually masks what is a massive win hiding in real applications (fusing padding with producers such that the entire net emission is a scalar add to output buffer pointer). As a rule of thumb whenever more than a single dispatch is involved it's really hard to use microbenchmarks for development unless they've been carefully extracted from larger representative programs, but even with single dispatches it can be difficult as the compiler may choose fallbacks in those cases not seen otherwise (consumer padding fusion kicks in if there's no producer so you'll see a single dispatch but it's not the same as if it were a single dispatch in a full program).

A single dispatch region with multiple code paths based on the workgroup ID is preferred here - if we split into multiple dispatches we would have a much harder time running them concurrently as we can't analyze the memory hazards. It's not impossible to analyze and would likely entail slicing out the boundaries, doing the work, and then inserting everything into a target tensor - and that has a lot of issues of its own.

In classic kernel coding this would be something like:

if (workgroup_id.x == 0) { // left }
else if (workgroup_id.y == 0) { // top }
else if (workgroup_id.x == workgroup_count.x) { // right }
else if (workgroup_id.y == workgroup_count.y) { // bottom }

Or y/z as the image space if x is used for threading, etc. I've seen people who do things like treat the enter z=0 as the center contents and z!=0 as borders by restructuring their workgroup mapping but that's a save-a-single-scalar-instruction kind of territory so not as critical as getting things into the right dispatch region. Note that this is good on CPU as well as GPU, but the programming model mostly originated on the GPU side out of necessity.

Codegen may already support this but there may be rough edges. To use this kind of pattern you'd emit flow.dispatch.workgroup.id/.count' ops in your dispatch regions and thenscf.if` and other control flow ops to decide the behavior. I suspect things with the automatic workgroup count inference will break but you can explicitly specify that instead.

A big area to look into (🤞 in 2023) is vertical tiling which would let us fuse much more and improve locality while retaining concurrency.

dpackwood commented 1 year ago

Thanks for your reply, I will attach an overall pipeline that we are interested in here: bokeh_linalg.mlir.txt

But it has huge caveat that here we are working around the pad issue by building the padded regions separately and then concatting them back onto the bulk image. This is a kind of workaround for the issues with tensor.pad, and is something I am looking to get rid of, hence the question around fusing these things into single workgroup, and the padding unrolling.

Thanks for the pointer about the workgroup ID, that approach hadn't occurred to me and it makes a lot of sense. I was thinking in a paradigm where each workgroup element would be doing exactly the same work, but now I realise that is not strictly necessary.

dpackwood commented 1 year ago

Just a small note to say that I spotted that later on the tensor.pad gets converted to a linalg.map anyway. If I change my test case to use linalg.map the unrolling stuff goes away so I am proceeding with that as a working direction.