[DT] Plans for the buffer allocation in data-tiling

The branch demonstrates how data-tiling + heterogeneous computing run altogether in IREE: https://github.com/iree-org/iree/pull/18738

Design Doc: https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e

IR dump: https://gist.github.com/hanhanW/5029dc652aec1379102e43e702aaf15b

How I think about buffer allocation in data-tiling

The default path is now materializing encodings at very early stage (i.e., GlobalOpt), while we want to build the late materialization path.
The early materialization path uses query_upper_bound + pad to get the actual size, while the late materialization uses max_padding.
One of my assumptions is that we will move the SetEncoding pass and MaterializeEncoding pass from GlobalOpt to preprocessing after we build late materialization path.
The other assumption I have is that we will no longer use query_upper_bound op and pad op, once we're at a better position.
Then there are two paths. One is doing all the things in preprocessing, and the other is what we're building now.
For the preprocessing one, we don't limit inner tile sizes because they will be materialized to actual pack/unpack ops way before Stream. It is at preprocessing level!
For the late materialization, we set encodings at flow dispatch level and introduces max_padding for later stream allocation.
This is why I'm saying that we can use max_padding to drive the tile size limitation in the materialization patterns.

What we can get from here is:

This unblocks the multi device work, so we will get to the stage that Ben showed us in the Seattle trip.
The further work is that we'll learn how to propagate inner tile sizes to stream or query the tile sizes from encodings. I assume that we will know the potential targets for each dispatch. This is where we use LCM to compute buffer sizes at Stream. The max_encoding is no longer important at this phase.
Given that the early materialization is moved to preprocessing, we will be able to insert those hints as well (multi-device things but only one CPU target in the list). So we should be able to use the same mechanism here.
If we can't use the same mechanism at preprocessing and assume that the max_encoding is not used at all, we can discard max_padding and just materialize them to whatever pack/unpack/mmt4d ops we want. It means that max_padding will be no longer needed. The materialization patterns ignore the max_padding, which makes me think that we can use the below execution plan for now.

Execution Plan

Retire the query_upper_bound op and CPUMaterializeUpperBoundTileSize pass.

Goal: remove old operations and decouple the deps between HAL and the CPU specific pass.

Plan: Update the max_padding semantic in the encoding. If it is set, the backend should take it into account and select appropriate inner tile sizes (to avoid out-of-bound access). If it is not set, the backend can decide whatever inner tile sizes they want. In the current default path (which will eventually be moved to preprocessing), we do not set max_padding attribute. In the path that we're building, we set max_padding attribute to hint the actual buffer size for Stream.

Finish the data-tiling fusion and basic functional GPU data-tiling

See https://github.com/iree-org/iree/issues/17722 for more details. Basically we want to enable fusion for mmt4d ops on CPU side, and build the data-tiling for GPU path. There are some changes needed by CPU backend, because mmt4d fusion is new. It is scoped in the #17722 issue.

Outcome: we'll be able to flip data-tiling to the fusion path and use data-tiling in multi-device project.

Move SetEncoding and MaterializationEncoding from GlobalOpt to preprocessing

Learn buffer allocation for multi-device (i.e., LCM?)

More items: TBD

cc @MaheshRavishankar @benvanik @bjacob @Max191 @pashu123 @lialan

@benvanik I implemented the AffinityAnalysisDialectInterface interface and a pass that attach the list of executables to targets field. I'll switch to EncodingSolver later, it is more like a prototype.

I'm going to look at SpecializeEncodingsPass tomorrow. Just in case if I misunderstood our discussion, could you skim through the IR or implementation when you're available? The implementation only modifies the encodings in util.func but not executables. My understanding is that the change of executables will happen SpecializeEncodingsPass which is not implemented yet.

Snippet of the IR dump:

// Before the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
    round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index

// After the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
    round_dims_to = array<i64: 32, 32, 32>, targets = [#executable_target_vmvx_bytecode_fb]>>{%0, %1} : index

Looks promising! Some initial notes:

AffinityAnalysisDialectInterface should be in StreamInterfaces.td
We don't want MakeEncodingSolvable to be separate from specialization - they need to be updated together in the SpecializeEncodingsPass. We can't have the IR be in an inconsistent state where the host and device don't agree on the encodings.
We will need to modify all tensor types on any stream tensor op (there's a IREE::Stream::TensorPhaseOp op trait indicating which ones are tensor ops) so that the types remain consistent.

Thanks for the note, it is helpful!

We can't have the IR be in an inconsistent state where the host and device don't agree on the encodings.

I see, makes sense!

We will need to modify all tensor types on any stream tensor op (there's a IREE::Stream::TensorPhaseOp op trait indicating which ones are tensor ops) so that the types remain consistent.

I found that TensorPhaseOp is just a trait; all the stream tensor ops have a argument with "TypeAttr" type (e.g., encoding, result_encoding, target_encoding, etc.). I think we only need to update those TypeAttr arguments. Do we need to update other tensor types?

the trait would allow for filtering to just those ops that may have tensors we want to change, instead of all ops in the program - so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs

so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs

SG! I'm using tensor.sizeof op for prototype now; will switch it to TensorPhaseOp later. (I'd like to see at least one e2e workflow working.) But this is exactly what I'm looking for! In my prototype, I filter the ops with Stream_AffinityOp interface; I'll switch to TensorPhaseOp.

I made some progress and got stuck in specialization. The issues are mostly about how we gather affinities, clone dispatches and update encodings, especially for multi-device concept. I was going to ping @benvanik on discord, then I realized that it is Friday afternoon! So I'm leaving messages here, and hopefully we can walk through an example next Monday.

Progress update and potential issue in EncodingSolver

I moved the Stream dialect interface from analysis/ to IR/ and verified that there are no dependency issues. I finished the backend encoding solver prototype (using VMVX), and found that there is a duplication issue when we create the solver. The difficulty is that the solver needs to access the target config (like cpu_features, iree_gpu configurations, etc.). We can either (a) pass the dictionary through interface method (i.e., calculateStorageElementCountInBytes) or (b) store the information in the parameter (like below snippet).

The issue in (a) is that we need to hold the dictionary somewhere until we resolving all the encoding informations. It will make the EncodingAttr's targets field holds a list of (Solver, either_target_executable_or_dictionary_configuration) pairs. I don't find pair attribute, so we likely need to introduce a pair attribute in encoding dialect.

The issue in (b) is that we are duplicating the config twice in IRs. One is in solver, and the other is in ExecutableTargets.

def IREECPU_VMVXEncodingSolverAttr :
    AttrDef<IREECPU_Dialect, "VMVXEncodingSolver", [
  DeclareAttrInterfaceMethods<IREEEncoding_EncodingSolverInterfaceAttr, [
    "calculateStorageElementCountInBytes",
  ]>
]> {
  let mnemonic = "vmvx_encoding_solver";
  let summary = "The encoding solver for VMVX backend";

  let assemblyFormat = "`<` struct(params) `>`";

  // The target configuration (from HAL::ExecutableTargetAttr) needs to be
  // initalized. Otherwise, it is not able to resolve encodings.
  let parameters = (ins
    AttrParameter<"DictionaryAttr", "">:$target_configuration
  );
}

Both solutions look bad to me. I think we need to let (c) ExecutableTargetAttr inherits from an Encoding attribute interface. It is something similar to what we discussed in the note:

  ExecutableTargetAttr : EncodingAttrInterface
    if (config.contains("encoding")) return encoding;
    return nullptr;
#hal.executable.target<{
  unified
  encoding_solver = #iree_gpu.amdgpu_encoding_solver // additional field
}>

In getExecutableTarget methods, we can populate the attribute and store it to the encoding_solver. We keep the ExecutableTarget attributes to the list and resolve the encodings in the specialization. Notedly, he HAL dialect only depends on Encoding dialect. IMO, it is a cleaner way.

The prototype now goes with (b) approach. It does not matter which one is implemented in the prototype. I'm not worried about it because it is solvable. I just need some input about which path should I go. I like (c) better, what do you think?

Specialization Issue

This part is hard to work out without an example. I'm at a state that can produce required encoding attribute, so I'd like to look at IR details with @benvanik together. My first step is creating the input IR, and studying multi-device concept. The writeup is good, btw.

I learned that a device could refer to a list of available devices. An AffinityAttr indicates a device (which can be a list and the device is selected from the list). My understanding was wrong because I thought that it includes all the devices..

Inlining the note to below, and I need @benvanik to help me unpack more context from it. I don't understand the terminology meaning of export, execution affinity, resource affinities, and dispatch site. There are two export ops. One is stream.executable.export op, and the other is stream.tensor.export op. Which one is the op that you mentioned in the note? Is it the executable.export op?

3. SpecializeEncodingsPass
  a. gather per-export [execution affinity -> [resource affinities]] map
  b. duplicate executable for each unique set of resource affinities
  c. update dispatch site to new executable
  d. update encoding attrs at all dispatch sites to executable targets
  e. update encoding attrs in all bindings to executable targets

// export -> [affinity -> array per resource of affinities PVS]
DenseMap<ExecutableExportOp, SetVector<std::pair<AffinityAttr, ArrayAttr>>> exportDispatchSites;

per dispatch site:
  each tensor has affinities per execution
  tryLookupExecutionAffinity(dispatch)
  tryLookupResourceAffinity(operand) / result
  (may want to expose ValueConsumerAffinityPVS)
  export key: [when executing on A, arg0=[A, B], arg1=[A]]
    "when executing on A then need arg0=[A, B], arg1=[A]"
  assume one execution affinity for now; no-op encoding if multiple
  if export is opaque (hal.executable/etc) no-op encoding
per export:
  f.e. unique site affinity
    duplicate executable
    update site to relevant duplicate executable (@ex1_a, @ex1_b)
  f.e. dispatch site per affinity:
    f.e. operand:
      union affinities from all sites
      get required targets
      update encoding attr

Below snippet is inlined from the output of current MakeEncodingSolvable pass. Let's take the %11 as an example. The affinity is @__device_0 which has two device targets.

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {ukernels = "none"}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  util.global private @__device_0 = #hal.device.select<[#device_target_local, #device_target_local1]> : !hal.device
  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }
    }
// ...
  util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @foo(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
    %c0 = arith.constant 0 : index
    %0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
    %1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
    %element_type_f32 = hal.element_type<f32> : i32
    %dense_row_major = hal.encoding_type<dense_row_major> : i32
    hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%0, %1]) type(%element_type_f32) encoding(%dense_row_major)
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %1} : index
    %3 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<?x?xf32>{%0, %1} in !stream.resource<external>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<external>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%2}
    %5 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
    %6 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
    hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%5, %6]) type(%element_type_f32) encoding(%dense_row_major)
    %7 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%5, %6} : index
    %8 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<?x?xf32>{%5, %6} in !stream.resource<external>{%7}
    %9 = stream.async.transfer %8 : !stream.resource<external>{%7} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%7}
    %10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
    %11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%4[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
    %12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%5, %6} : index
    %13 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_1::@foo_dispatch_1_set_encoding_RHS_DxD[%5, %6](%9[%c0 to %7 for %7], %5, %6) : (!stream.resource<*>{%7}, index, index) -> !stream.resource<*>{%12}
    %14 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %5} : index
    %15 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_2::@foo_dispatch_2_matmul_DxDxD_f32[%1, %6, %0, %5](%11[%c0 to %10 for %10], %13[%c0 to %12 for %12], %1, %6, %0, %5) : (!stream.resource<*>{%10}, !stream.resource<*>{%12}, index, index, index, index) -> !stream.resource<*>{%14}
    %16 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %5} : index
    %17 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_3::@foo_dispatch_3_unset_encoding_RESULT_DxD[%0, %5](%15[%c0 to %14 for %14], %0, %5) : (!stream.resource<*>{%14}, index, index) -> !stream.resource<*>{%16}
    %18 = stream.async.transfer %17 : !stream.resource<*>{%16} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%16}
    %19 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %18 : tensor<?x?xf32>{%0, %5} in !stream.resource<external>{%16} -> !hal.buffer_view
    util.return %19 : !hal.buffer_view
  }
}

What does "dispatch site" mean? There are ops like stream.*.dispatch @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD. Is it foo_dispatch_0_set_encoding_LHS_DxD in the example?

%11 = stream.async.dispatch
  on(#hal.device.affinity<@__device_0>)
  @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
    [%0, %1](%4[%c0 to %2 for %2], %0, %1)
  : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}

What do we do when we duplicate executable? Does it mean that we are cloning more func.func in stream.executable(builtin.module(...))? E.g.,

  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }

becomes

  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      // The device name can be in suffix. I make it prefix for readability.
      func.func @llvmcpu_foo_dispatch_0_set_encoding_LHS_DxD {
         // the target field in the encoding becomes [#executable_target_embedded_elf_x86_64_ ]
      }
      func.func @vmvx_foo_dispatch_0_set_encoding_LHS_DxD {
         // the target field in the encoding becomes [#executable_target_vmvx_bytecode_fb ]
      }
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }

Is it correct? If so, what should the main function look like? It was a single stream.async.dispatch op. Where do I add the logics of which dispatch site should be used. Do I add some if-else conditions in the main function. Or do I create an entry function in the executable, and we will make the decision there?

Also, how do I get the "execution affinity"? I assumed that it means the actual device that we're going to run on, is it correct?

%11 = stream.async.dispatch
  on(#hal.device.affinity<@__device_0>)
  @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
    [%0, %1](%4[%c0 to %2 for %2], %0, %1)
  : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}

It will be very helpful if we can look at the IR and modify few of them manually sometime next week!

Let's chat next week, but I'm confused at what a solver is and why it needs anything at all on it. We shouldn't have any duplication. The solver is just a way to reference a function, essentially, and doesn't need any information of its own (unless there are encoding-specific information). Maybe we also need to change the name "solver" - that may be causing the confusion.

(epic progress, though! it's really coming together :)

I have a prototype that addresses the dup config issue. One of the challenges is that the attribute is not mutable, so we can not update some field once we create it. The other challenge is that the interface can't have parameters (which is fair). So my solution is to declare an interface method to get the config.

The prototype wraps the whole dictionary config into "encoding". In the HAL::ExecutableTargetAttr, I renamed the configuration to wrap_configuration and implemented a new getConfiguration (just a quick workaround for existing code) method. If there is a encoding, the method returns the config wrapped in the attribute. Otherwise, it returns the config.

Without this commit, the IR is:

#executable_target_vmvx_bytecode_fb =
  #hal.executable.target<
    "vmvx",
    "vmvx-bytecode-fb",
    {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>, {ukernels = "none"} }
>

With the commit, the IR is:

#executable_target_vmvx_bytecode_fb =
  #hal.executable.target<
    "vmvx",
    "vmvx-bytecode-fb",
    {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}
>

======

Side note:

I found a bug about dialect registration in few passes during the prototype. The overwritten getDependentTarget (from TargetBackend) is not used in AssignLegacyTargetDevices pass and ResolveDeviceAliases pass. So they are ignored, and only the HAL dialect (and the dialects that HAL depends on) are loaded at the pass level. I'll send the fix later.

I wrote an example about running one matmul on device_a and the same matmul on the device_b; it gives me the multi-device IR that we want to solve in SpecializeEncoding pass.

I put some critical IRs in the snippet, and now I think I understand what we want to duplicate for executables. The set_encoding_LHS dispatch is used by both device, while they are referring to the same function. We need to duplicate the site (i.e., functions inside an export) and update the site to relevant duplicate executables.

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
        %c0 = arith.constant 0 : index
        %0 = flow.dispatch.workload.ordinal %arg1, 0 : index
        %1 = flow.dispatch.workload.ordinal %arg2, 1 : index
        %2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
        %3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
        %4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
        %5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
        flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
        return
      }
    }
  }

// ...

  util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
    %14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
    %25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// ....

The previous one (e.g., device = [#device_a, #device_b]) that has multiple targets in a device is something we need to treat like no-op encoding case. I think this is the scope of the work.

I'll start implementing the rest part of SpecializeEncoding pass, and share the update.

Note: here is the example input that I used in the prototype. What the IR does is

Compute matmul result on device_a
Transfer lhs/rhs tensors to device_b
Compute matmul result on device_b
Transfer the matmul result on device_a to device_b
Accumulate two matmul results on device_b
Transfer the final result to device_a and return the result.

module {
  func.func @foo(%arg0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %arg1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
    %dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
    %dim_1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %2 = linalg.matmul ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%1 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %3 = flow.tensor.transfer %2 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_b>
    %4 = flow.tensor.transfer %arg0 : tensor<?x?xf32>{%dim, %dim_0} to #hal.device.promise<@device_b>
    %5 = flow.tensor.transfer %arg1 : tensor<?x?xf32>{%dim_0, %dim_1} to #hal.device.promise<@device_b>
    %6 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
    %7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %8 = linalg.matmul ins(%4, %5 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%7 : tensor<?x?xf32>) -> tensor<?x?xf32>
    %9 = arith.addf %3, %8 : tensor<?x?xf32>
    %10 = flow.tensor.transfer %9 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_a>
    return %10 : tensor<?x?xf32>
  }
}

Commands to generate the IR: build/tools/iree-compile --iree-execution-model=async-external --iree-hal-target-device="device_a=vmvx" --iree-hal-target-device="device_b=llvm-cpu" --iree-hal-local-target-device-backends=vmvx --iree-hal-local-target-device-backends=llvm-cpu ~/repro.mlir -o /tmp/z.vmfb --mlir-print-ir-after-all --mlir-disable-threading 2> ~/log2 --iree-global-opt-enable-early-materialization=false. (There might be redundant CLI flags, but it's fine IMO. I just want to get the IR before SpecializeEncoding pass.)

I have a second take for dup config issue w/o HAL attribute changes. It is still creating an additional level of wrapping but it is scoped within the Codegen directory. I.e., I create a new method (i.e., getTargetConfig(HAL::ExecutableTarget) in Codegen/Utils to resolve optional encoding solver wrapper. Instead of using targetAttr.getConfiguration(), we'll need to use getTargetConfig(targetAttr) in Codegen. I'll chat with @MaheshRavishankar in tomorrow's meeting.

It looks cleaner because the host/HAL side does not really care about the configuration field (IMO). They are target features and should(?) only be used by Codegen backends.

Although I haven't finished the update of cloned executable part, but it looks like I'm doing something wrong. So I posted the update here and I'm looking for feedback.

So I have a commit, which collects the "export -> affinities" maps and duplicates the stream.executable ops; the commit also updates the entry points of each stream.async.dispatch ops. However, it crashes in HAL::MaterializeInterface pass, which makes me feel that I'm doing something wrong. The error is from invalid casting in BindingLayoutAnalysis.

decltype(auto) llvm::dyn_cast(From *) [To = mlir::iree_compiler::IREE::Stream::ExecutableExportOp, From = mlir::Operation]: Assertion `detail::isPresent(Val) && "dyn_cast on a non-existent value"' failed.

IR before my SpecializedEncoding pass:

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
        %c0 = arith.constant 0 : index
        %0 = flow.dispatch.workload.ordinal %arg1, 0 : index
        %1 = flow.dispatch.workload.ordinal %arg2, 1 : index
        %2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
        %3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
        %4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
        %5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
        flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
        return
      }
    }
  }

// ...

  util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
    %14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
    %25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// ....

IR after my SpecializedEncoding pass:

  stream.executable private @foo_dispatch_0 { ... }
// cloned executable
  stream.executable private @foo_dispatch_0_0 {
// The body is the same, no changed.
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
        %c0 = arith.constant 0 : index
        %0 = flow.dispatch.workload.ordinal %arg1, 0 : index
        %1 = flow.dispatch.workload.ordinal %arg2, 1 : index
        %2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
        %3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1,
 #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
        %4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
        %5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64:
32, 32, 32>>>
        flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #
map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_di
ms_to = array<i64: 32, 32, 32>>>>{%0, %1}
        return
      }
    }
  }
// ...

%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}

// The updated stream.async.dispatch op now has @foo_dispatch_0_0::@... entry point.
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}

@benvanik do I clone the stream.executable for each unique "operands affinities list"? Or do I add new stream.executable.export public op to the original stream.executable region and dup the func.func op instead?

Adding the status update to the issue:

https://github.com/iree-org/iree/pull/18738 this PR has the prototype, and you can find the design doc at https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e

We had a great brainstorm yesterday, and here are the required items to land the prototype to the main branch:

Rename the attribute interface. Solver is not a good name.
Study how to represent the layouts in the solver. In my prototype, it takes the whole target configuration (i.e., the dictionary attribute) while we don't really need all the information. We need to compress the information in the encoding. I have few ideas, but I need to do some homework. This part needs some input from Benoit and Quinn. It should be simplified when we land it to main.

We also chat about the case that a device has several executable targets. IMO, we’re able to specialize the case in my prototype. It will be the next TODO after I land my prototype to the main branch.

The other next topics in my mind is to cancel encodings properly. Ben suggested me to look it at flow level, and turn them into flow.clone ops when we know that the potential targets do not implement encodings.

The information can be queried by the same analysis -- but it's fuzzier. It's on my TODO list.

I have a prototype for encoding information compression. It still carry the whole config (as an intermediate step) when we populate the attributes from HALAffinityAnalysisDialectInterface implementation. The main difference is that we introduce an cloneWithSimplifiedConfig interface method and calls the method when updating the encodings. In the final IR, the encoding_solver attributes have the serialized MaterializeEncodingInfo which defines the layouts in data-tiling. See below snippet for the final IR dump.

On the codegen side, there are two groups of encodings. One has encoding_solver and the other does not. The boundary operations (e.g., hal.binding, flow.dispatch.tensor.load/store, etc) have solver attributes which describes that the incoming layout. The other operations do not have solve attributes and they are all compute ops, which means that they will be executed on the device that attached on hal.executable.variant. In the materialization, we'll need to update the logics for boundary operations. If they have different layout, we'll need to undo the relayout from device_a and reapply the relayout for device_b. Those undo-relayout operations need to be queried from the encoding solver attribute. In my prototype, I cancel the encodings when the layout mismatch. It is just not implemented in my prototype. It's fixable. This way avoids the encoding propagation, and reduce the difficulty that Mahesh pointed out.

@benvanik @bjacob @qedawkins @MaheshRavishankar Does the way that I define layout look good to you?

#encoding_solver = #iree_cpu.cpu_encoding_solver<>
#encoding_solver1 = #iree_cpu.vmvx_encoding_solver<>
#encoding_solver2 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 1], outerDimsPerm = [0, 1]}>
#encoding_solver3 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [0, 1]}>
#encoding_solver4 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [16, 1], outerDimsPerm = [1, 0]}>
#encoding_solver5 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [1, 0]}>
#encoding_solver6 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 16], outerDimsPerm = [0, 1]}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#encoding = #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver2]>
#encoding1 = #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding2 = #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]>
#encoding3 = #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver4]>
#encoding4 = #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding5 = #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver5]>
#encoding6 = #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver6]>
#encoding7 = #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding8 = #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]>

Okay, I have verified that e2e is working with the new changes. I still don't get a good name for the attribute interface, perhaps I'll just call it EncodingAttrInterface for now, and we can always rename it later. So I'm going to start breaking down my prototype and land it to the main branch.

I think we can name it to EncodingLayoutAttrInterface. All the methods are about layouts, e.g., storage size calculation, materialized layout shape, generating operations for device layout, etc. I also want to replace the new targets field with layouts in the EncodingAttr.

iree-org / iree