Using an operand's `iree.abi.tied` attribute to `MaterializeCopyOnWrite` in a `func::CallOp`

vmurali commented 1 year ago

What happened?

MaterializeCopyOnWrite ignores the tying of operand to the result and assumes the operand is not modified, passing the updated operand incorrectly downstream, when the original value is required.

In the following example, Triple.function gets the updated value of a previous tensor.empty, instead of getting a new tensor.empty.

module @example {
  func.func private @Double.function(tensor<?x?xi32>, tensor<?x?xi32>) -> (tensor<?x?xi32> {iree.abi.tied = 1 : index}) attributes {
    iree.abi.model = "coarse-fences",
    nosideeffects
  }

  func.func private @Triple.function(tensor<?x?xi32>, tensor<?x?xi32>) -> (tensor<?x?xi32> {iree.abi.tied = 1 : index}) attributes {
    iree.abi.model = "coarse-fences",
    nosideeffects
  }

  func.func @main(%arg0: tensor<?x?xi32>) -> tensor<?x?xi32> {

    %c0 = arith.constant 0: index
    %c1 = arith.constant 1: index
    %dim0 = tensor.dim %arg0, %c0: tensor<?x?xi32>
    %dim1 = tensor.dim %arg0, %c1: tensor<?x?xi32>
    %0 = arith.muli %arg0, %arg0 : tensor<?x?xi32>
    %init = tensor.empty(%dim0, %dim1) : tensor<?x?xi32>
    %p1 = call @Double.function(%0, %init) : (tensor<?x?xi32>, tensor<?x?xi32>) -> tensor<?x?xi32>
    %init2 = tensor.empty(%dim0, %dim1) : tensor<?x?xi32>
    %1 = call @Triple.function(%p1, %init2) : (tensor<?x?xi32>, tensor<?x?xi32>) -> tensor<?x?xi32>

    %2 = arith.muli %1, %1 : tensor<?x?xi32>

    return %2 : tensor<?x?xi32>
  }

}

Steps to reproduce your issue

$iree-compile --iree-execution-model=async-external --iree-hal-target-backends=llvm-cpu samples/custom_module/async/test/example.mlir --compile-to=stream

This will produce a single stream allocation for both the tensor.empty().

    %10 = stream.resource.alloc uninitialized : !stream.resource<external>{%3} <--------------- allocation
    stream.timepoint.chain_external %9 => (%fence : !hal.fence)
    %fence_0 = hal.fence.create device(%device : !hal.device) flags("None") : !hal.fence
    %11 = stream.tensor.export %6 : tensor<?x?xi32>{%0, %1} in !stream.resource<external>{%3} -> !hal.buffer_view
    %12 = stream.tensor.export %10 : tensor<?x?xi32>{%0, %1} in !stream.resource<external>{%3} -> !hal.buffer_view <------------- export
    %13 = call @Double.function(%11, %12, %fence, %fence_0) : (!hal.buffer_view, !hal.buffer_view, !hal.fence, !hal.fence) -> !hal.buffer_view <--------------- first use
    %14 = hal.buffer_view.dim<%13 : !hal.buffer_view>[0] : index
    %15 = hal.buffer_view.dim<%13 : !hal.buffer_view>[1] : index
    hal.buffer_view.assert<%13 : !hal.buffer_view> message("tensor") shape([%14, %15]) type(%c268435488_i32) encoding(%c1_i32)
    %16 = stream.timepoint.import %fence_0 : (!hal.fence) => !stream.timepoint
    %fence_1 = hal.fence.create device(%device : !hal.device) flags("None") : !hal.fence
    stream.timepoint.chain_external %16 => (%fence_1 : !hal.fence)
    %fence_2 = hal.fence.create device(%device : !hal.device) flags("None") : !hal.fence
    %17 = call @Triple.function(%13, %12, %fence_1, %fence_2) : (!hal.buffer_view, !hal.buffer_view, !hal.fence, !hal.fence) -> !hal.buffer_view <------------------ second use

What component(s) does this issue relate to?

Compiler, Runtime

vmurali commented 1 year ago

A (bad, but functional) workaround is for the external C(++) function to allocate a new buffer and return that instead of updating in place.

benvanik commented 1 year ago

Returning a new allocation has been the assumption for this kind of custom work (hence why there's no support for this yet :), but it can definitely be improved. I think we may be able to hold on actually improving though if we can get you a workaround.

The external function can use iree_hal_device_queue_alloca to schedule an async allocation and that's effectively what the compiler would be doing if this was fixed. External functions in general have a lot of issues around memory management as without a tremendous amount of annotation we'll never be able to do the right thing for each scenario as there's an infinite space of "right" - for example we don't know if the external function is going to be using the memory on the host and needs it mapped into host memory or if the memory will live longer than a few seconds and shouldn't come from our high-frequency pools. If an external function returns 4 values we also don't know if those need 4 separate allocations or could be 4 slices into one, or even if those allocations are from the same device as the values passed in (which themselves may come from multiple devices). This is what streamable calls are meant to help with as the mechanism adds some limits such that we can assume that memory is transient and device local, etc.

So if you want to unblock I'd allocate inside the function as samples/custom_module/async/ does: https://github.com/openxla/iree/blob/df7522b098f9be1252a114d76a80c5fba9b19d42/samples/custom_module/async/module.cc#L211-L235 That's known to work today and doing the allocation on the compiler side is mostly just an optimization (though sometimes/often a pessimization!) that also avoids some native code goo.

For this issue I think we'd need to figure out everything that may impact the behavior and ensure it's all modeled. Conceptually I think registering TiedOpInterface on func::CallOp would make materialize-copy-on-write happy but it gets real messy as there's no verification and things like outlining/inlining - each func.call would need the correct information, that would need to stay in sync across dialect conversion/signature rewriting, etc. I'm pretty sure it's something we would not be able to land but it may be fine for local hacking. A longer-term approach is to have a util.func/util.call that implements all our fancy interfaces and properly tracks/verifies but that's a larger bit of work (similar to #12797, which wasn't a quick fix).

It's never going to be good to treat func.func/func.call as anything more than generic calls where performance is a total coin-flip. I think we're probably ok with that for now until we have more running and we can show what the issues are. I'm leaning towards suggesting the workaround of doing what samples/custom_module/async/ does with its queue-ordered allocation and waiting to see what we want here (possibly an entirely new mechanism).

vmurali commented 1 year ago

So if you want to unblock I'd allocate inside the function

Yes, this is what I meant by "A (bad, but functional) workaround is for the external C(++) function to allocate a new buffer and return that instead of updating in place."

The reasons this is bad is as follows: a) One needs to pass in the shapes of the output tensors. This requires allocating a buffer to store the shapes and send it across, when a output operand buffer would have all this information already. Moreover, the shape information should be available to the host, which means when passing the buffer to store the shapes, that buffer cannot be allocated on the device. This is tricky to do on the compiler - creating two different buffer types for shapes vs tensors. b) Instead of (a) if one passes an output buffer to avoid allocating a buffer for shapes, then one has to pass in a dummy tensor with the correct output shape (using tensor.empty). This results in an additional allocation for tensor.empty. But if one ignores the extra allocation, this would work.

vmurali commented 1 year ago

MaterializeCopyOnWrite is also not ideal, because it will literally be copying undefined values to a new tensor. Is there an equivalent of bufferization.alloc_tensor (semantics here) for stream allocations? I am asking this because of the following discord discussion:

Murali Vijayaraghavan : @benvanik is there an equivalent of bufferization.alloc_tensor meant for stream resource allocation? (here) Ben Vanik : maybe, but you're still asking ... (here)

iree-org / iree