iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.56k stars 573 forks source link

Identify DMA-compatible dispatch ops and turn them into the corresponding flow ops #5105

Closed benvanik closed 3 years ago

benvanik commented 3 years ago

(continuing from https://github.com/google/iree/pull/5102#discussion_r593777760, mostly here to track/discuss vs. being a single issue - we can split it up as desired :)

Getting on the DMA path when possible is very important on non-CPU targets (GPU, TPU, remote/sandbox, etc) as it allows for transfer operations to be overlapped with execution and for the transfers to not steal execution resources from actual work we want to perform.

Currently we are looking for specific HLO ops inside of the compiler like mhlo.dynamic_update_slice such that we can turn them into DMA operations: https://github.com/google/iree/blob/4e21f99d4d1aad4d7c79783c89d5fcab852c371e/iree/compiler/Dialect/Flow/Conversion/HLOToFlow/ConvertHLOToFlow.cpp#L42-L59 This is a bad way of doing it for many reasons: it brings in an HLO dep, prevents HLO->linalg from running earlier, doesn't handle TOSA/other input dialects at all, prevents folding as then the ops can't get combined with other linalg ops that may make them much more efficient, etc.

Instead we could have patterns registered as canonicalizations on flow.dispatch.workgroups or run as part of a dedicated post-dispatch-formation pass that looked for the small set of DMA-compatible behavior we want to handle. Then we can perform that regardless of input dialect and in a uniform way that does not prevent the potential fusions (as it runs after them). With the recent addition of tied results on dispatch ops this will also allow for more chances for copy elision as we aggressively fuse these transfer ops and only look at the ones that are standalone after dispatch region formation and thus not able to benefit from the fusion.

Specific ops we'll want to look for are:

By using the flow.tensor.* ops we'll be able to better allocate memory, better schedule execution (especially across streams), and hoist operations into variables. The original set of tensor ops was picked at random and can be refined as part of this: we likely don't need flow.tensor.reshape anymore, for example, while there may be other kinds of ops we want - particularly those related to partial host->device transfers or readbacks.

For small non-contiguous sets we can insert loops or unroll - for example, an update of 4 partial rows of a larger tensor could become 4 flow.tensor.update ops. There's of course a limit to this: partial scatter/gather across many non-contiguous regions would be better suited for execution, as otherwise you may need hundreds or thousands of DMA operations to emulate them. For the initial implementations it'd be fine to restrict the cases to bounded ones we know about and future improvements could be to use scf.if on dynamic values to either do a bounded loop of DMA operations or switch to an execution-based approach after a threshold. We can also turn flow.tensor.updates of non-contiguous tensors into flow.dispatch.workgroup ops: this would let us more generally handle these cases as the simple cases will remain flow.tensor.update down to the HAL (where they become the hal.command_buffer.copy_buffer DMA op or placed allocations) and the more complex ones move to execution. Lots of fun design space to explore.

While we do have tied results now that allow for in-place operations performing partial in-place updates is often a pessimization to concurrency and allocation as the entire allocation must be kept live and synchronized with respect to any other operation that may be sharing the underlying buffer. This means that a dispatch that updates a single byte of a 100MB buffer would prevent all other dispatches from concurrently touching that 100MB buffer or may even cause that 100MB to be cloned in order to preserve the semantic timeline of the original program! This issue will help avoid many of those situations as the flow.tensor.* ops can be used to perform better placement (for example, produce a single byte as part of execution and then copy that byte into its final destination). One of the canonicalizations here may be to take dispatches with subtensor_insert and change them to producing only the inserted value and then feeding the dispatch result into a flow.tensor.update - the allocator will then be able to ensure placement and elide any copies. This would only work if the target location of the insert could be hoisted as well and wasn't for example a scatter indirection. This may happen by construction during the dispatch region formation but we should probably verify. More fun designs to play with :)

For example, this could be a flow.tensor.splat:

func @splat() -> tensor<1x225x225x3xf32> {
  %cst = constant 0.000000e+00 : f32
  %0 = linalg.init_tensor [1, 225, 225, 3] : tensor<1x225x225x3xf32>
  %1 = linalg.fill(%0, %cst) : tensor<1x225x225x3xf32>, f32 -> tensor<1x225x225x3xf32>
  return %1 : tensor<1x225x225x3xf32>
}

We can poke through some IR for our more complex models coming out of the linalg-on-tensors path and see what other ones we can spot.

MaheshRavishankar commented 3 years ago

This looks like a really nice starter tasks is someone outside the regular folks want to pick up.

MaheshRavishankar commented 3 years ago

@julianwa FYI in case there is someone you know off that is looking for such starter tasks.

benvanik commented 3 years ago

This looks like a really nice starter tasks is someone outside the regular folks want to pick up.

For sure!

antiagainst commented 3 years ago

I've a patch to fold pure fill into TensorSplatOp. Will send out later after landing a few dependencies.

MaheshRavishankar commented 3 years ago

This is mostly done now. Maybe done too aggresively. Closing this.