iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.76k stars 600 forks source link

[DT] Plans for the buffer allocation in data-tiling #17924

Open hanhanW opened 3 months ago

hanhanW commented 3 months ago

How I think about buffer allocation in data-tiling

  1. The default path is now materializing encodings at very early stage (i.e., GlobalOpt), while we want to build the late materialization path.
  2. The early materialization path uses query_upper_bound + pad to get the actual size, while the late materialization uses max_padding.
  3. One of my assumptions is that we will move the SetEncoding pass and MaterializeEncoding pass from GlobalOpt to preprocessing after we build late materialization path.
  4. The other assumption I have is that we will no longer use query_upper_bound op and pad op, once we're at a better position.
  5. Then there are two paths. One is doing all the things in preprocessing, and the other is what we're building now.
  6. For the preprocessing one, we don't limit inner tile sizes because they will be materialized to actual pack/unpack ops way before Stream. It is at preprocessing level!
  7. For the late materialization, we set encodings at flow dispatch level and introduces max_padding for later stream allocation.
  8. This is why I'm saying that we can use max_padding to drive the tile size limitation in the materialization patterns.

What we can get from here is:

  1. This unblocks the multi device work, so we will get to the stage that Ben showed us in the Seattle trip.
  2. The further work is that we'll learn how to propagate inner tile sizes to stream or query the tile sizes from encodings. I assume that we will know the potential targets for each dispatch. This is where we use LCM to compute buffer sizes at Stream. The max_encoding is no longer important at this phase.
  3. Given that the early materialization is moved to preprocessing, we will be able to insert those hints as well (multi-device things but only one CPU target in the list). So we should be able to use the same mechanism here.
  4. If we can't use the same mechanism at preprocessing and assume that the max_encoding is not used at all, we can discard max_padding and just materialize them to whatever pack/unpack/mmt4d ops we want. It means that max_padding will be no longer needed. The materialization patterns ignore the max_padding, which makes me think that we can use the below execution plan for now.

Execution Plan

Retire the query_upper_bound op and CPUMaterializeUpperBoundTileSize pass.

Goal: remove old operations and decouple the deps between HAL and the CPU specific pass.

Plan: Update the max_padding semantic in the encoding. If it is set, the backend should take it into account and select appropriate inner tile sizes (to avoid out-of-bound access). If it is not set, the backend can decide whatever inner tile sizes they want. In the current default path (which will eventually be moved to preprocessing), we do not set max_padding attribute. In the path that we're building, we set max_padding attribute to hint the actual buffer size for Stream.

Finish the data-tiling fusion and basic functional GPU data-tiling

See https://github.com/iree-org/iree/issues/17722 for more details. Basically we want to enable fusion for mmt4d ops on CPU side, and build the data-tiling for GPU path. There are some changes needed by CPU backend, because mmt4d fusion is new. It is scoped in the #17722 issue.

Outcome: we'll be able to flip data-tiling to the fusion path and use data-tiling in multi-device project.

Move SetEncoding and MaterializationEncoding from GlobalOpt to preprocessing

Learn buffer allocation for multi-device (i.e., LCM?)

More items: TBD

cc @MaheshRavishankar @benvanik @bjacob @Max191 @pashu123 @lialan

hanhanW commented 1 week ago

@benvanik I implemented the AffinityAnalysisDialectInterface interface and a pass that attach the list of executables to targets field. I'll switch to EncodingSolver later, it is more like a prototype.

I'm going to look at SpecializeEncodingsPass tomorrow. Just in case if I misunderstood our discussion, could you skim through the IR or implementation when you're available? The implementation only modifies the encodings in util.func but not executables. My understanding is that the change of executables will happen SpecializeEncodingsPass which is not implemented yet.

Snippet of the IR dump:

// Before the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
    round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index

// After the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
    round_dims_to = array<i64: 32, 32, 32>, targets = [#executable_target_vmvx_bytecode_fb]>>{%0, %1} : index
benvanik commented 1 week ago

Looks promising! Some initial notes:

hanhanW commented 1 week ago

Thanks for the note, it is helpful!

We can't have the IR be in an inconsistent state where the host and device don't agree on the encodings.

I see, makes sense!

We will need to modify all tensor types on any stream tensor op (there's a IREE::Stream::TensorPhaseOp op trait indicating which ones are tensor ops) so that the types remain consistent.

I found that TensorPhaseOp is just a trait; all the stream tensor ops have a argument with "TypeAttr" type (e.g., encoding, result_encoding, target_encoding, etc.). I think we only need to update those TypeAttr arguments. Do we need to update other tensor types?

benvanik commented 1 week ago

the trait would allow for filtering to just those ops that may have tensors we want to change, instead of all ops in the program - so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs

hanhanW commented 1 week ago

so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs

SG! I'm using tensor.sizeof op for prototype now; will switch it to TensorPhaseOp later. (I'd like to see at least one e2e workflow working.) But this is exactly what I'm looking for! In my prototype, I filter the ops with Stream_AffinityOp interface; I'll switch to TensorPhaseOp.

hanhanW commented 1 week ago

I made some progress and got stuck in specialization. The issues are mostly about how we gather affinities, clone dispatches and update encodings, especially for multi-device concept. I was going to ping @benvanik on discord, then I realized that it is Friday afternoon! So I'm leaving messages here, and hopefully we can walk through an example next Monday.

Progress update and potential issue in EncodingSolver

I moved the Stream dialect interface from analysis/ to IR/ and verified that there are no dependency issues. I finished the backend encoding solver prototype (using VMVX), and found that there is a duplication issue when we create the solver. The difficulty is that the solver needs to access the target config (like cpu_features, iree_gpu configurations, etc.). We can either (a) pass the dictionary through interface method (i.e., calculateStorageElementCountInBytes) or (b) store the information in the parameter (like below snippet).

The issue in (a) is that we need to hold the dictionary somewhere until we resolving all the encoding informations. It will make the EncodingAttr's targets field holds a list of (Solver, either_target_executable_or_dictionary_configuration) pairs. I don't find pair attribute, so we likely need to introduce a pair attribute in encoding dialect.

The issue in (b) is that we are duplicating the config twice in IRs. One is in solver, and the other is in ExecutableTargets.

def IREECPU_VMVXEncodingSolverAttr :
    AttrDef<IREECPU_Dialect, "VMVXEncodingSolver", [
  DeclareAttrInterfaceMethods<IREEEncoding_EncodingSolverInterfaceAttr, [
    "calculateStorageElementCountInBytes",
  ]>
]> {
  let mnemonic = "vmvx_encoding_solver";
  let summary = "The encoding solver for VMVX backend";

  let assemblyFormat = "`<` struct(params) `>`";

  // The target configuration (from HAL::ExecutableTargetAttr) needs to be
  // initalized. Otherwise, it is not able to resolve encodings.
  let parameters = (ins
    AttrParameter<"DictionaryAttr", "">:$target_configuration
  );
}

Both solutions look bad to me. I think we need to let (c) ExecutableTargetAttr inherits from an Encoding attribute interface. It is something similar to what we discussed in the note:

  ExecutableTargetAttr : EncodingAttrInterface
    if (config.contains("encoding")) return encoding;
    return nullptr;
#hal.executable.target<{
  unified
  encoding_solver = #iree_gpu.amdgpu_encoding_solver // additional field
}>

In getExecutableTarget methods, we can populate the attribute and store it to the encoding_solver. We keep the ExecutableTarget attributes to the list and resolve the encodings in the specialization. Notedly, he HAL dialect only depends on Encoding dialect. IMO, it is a cleaner way.

The prototype now goes with (b) approach. It does not matter which one is implemented in the prototype. I'm not worried about it because it is solvable. I just need some input about which path should I go. I like (c) better, what do you think?

Specialization Issue

This part is hard to work out without an example. I'm at a state that can produce required encoding attribute, so I'd like to look at IR details with @benvanik together. My first step is creating the input IR, and studying multi-device concept. The writeup is good, btw.

I learned that a device could refer to a list of available devices. An AffinityAttr indicates a device (which can be a list and the device is selected from the list). My understanding was wrong because I thought that it includes all the devices..

Inlining the note to below, and I need @benvanik to help me unpack more context from it. I don't understand the terminology meaning of export, execution affinity, resource affinities, and dispatch site. There are two export ops. One is stream.executable.export op, and the other is stream.tensor.export op. Which one is the op that you mentioned in the note? Is it the executable.export op?

3. SpecializeEncodingsPass
  a. gather per-export [execution affinity -> [resource affinities]] map
  b. duplicate executable for each unique set of resource affinities
  c. update dispatch site to new executable
  d. update encoding attrs at all dispatch sites to executable targets
  e. update encoding attrs in all bindings to executable targets

// export -> [affinity -> array per resource of affinities PVS]
DenseMap<ExecutableExportOp, SetVector<std::pair<AffinityAttr, ArrayAttr>>> exportDispatchSites;

per dispatch site:
  each tensor has affinities per execution
  tryLookupExecutionAffinity(dispatch)
  tryLookupResourceAffinity(operand) / result
  (may want to expose ValueConsumerAffinityPVS)
  export key: [when executing on A, arg0=[A, B], arg1=[A]]
    "when executing on A then need arg0=[A, B], arg1=[A]"
  assume one execution affinity for now; no-op encoding if multiple
  if export is opaque (hal.executable/etc) no-op encoding
per export:
  f.e. unique site affinity
    duplicate executable
    update site to relevant duplicate executable (@ex1_a, @ex1_b)
  f.e. dispatch site per affinity:
    f.e. operand:
      union affinities from all sites
      get required targets
      update encoding attr

Below snippet is inlined from the output of current MakeEncodingSolvable pass. Let's take the %11 as an example. The affinity is @__device_0 which has two device targets.

#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {ukernels = "none"}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
  util.global private @__device_0 = #hal.device.select<[#device_target_local, #device_target_local1]> : !hal.device
  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }
    }
// ...
  util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @foo(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
    %c0 = arith.constant 0 : index
    %0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
    %1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
    %element_type_f32 = hal.element_type<f32> : i32
    %dense_row_major = hal.encoding_type<dense_row_major> : i32
    hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%0, %1]) type(%element_type_f32) encoding(%dense_row_major)
    %2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %1} : index
    %3 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<?x?xf32>{%0, %1} in !stream.resource<external>{%2}
    %4 = stream.async.transfer %3 : !stream.resource<external>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%2}
    %5 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
    %6 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
    hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%5, %6]) type(%element_type_f32) encoding(%dense_row_major)
    %7 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%5, %6} : index
    %8 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<?x?xf32>{%5, %6} in !stream.resource<external>{%7}
    %9 = stream.async.transfer %8 : !stream.resource<external>{%7} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%7}
    %10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
    %11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%4[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
    %12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%5, %6} : index
    %13 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_1::@foo_dispatch_1_set_encoding_RHS_DxD[%5, %6](%9[%c0 to %7 for %7], %5, %6) : (!stream.resource<*>{%7}, index, index) -> !stream.resource<*>{%12}
    %14 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type =  matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %5} : index
    %15 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_2::@foo_dispatch_2_matmul_DxDxD_f32[%1, %6, %0, %5](%11[%c0 to %10 for %10], %13[%c0 to %12 for %12], %1, %6, %0, %5) : (!stream.resource<*>{%10}, !stream.resource<*>{%12}, index, index, index, index) -> !stream.resource<*>{%14}
    %16 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %5} : index
    %17 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_3::@foo_dispatch_3_unset_encoding_RESULT_DxD[%0, %5](%15[%c0 to %14 for %14], %0, %5) : (!stream.resource<*>{%14}, index, index) -> !stream.resource<*>{%16}
    %18 = stream.async.transfer %17 : !stream.resource<*>{%16} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%16}
    %19 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %18 : tensor<?x?xf32>{%0, %5} in !stream.resource<external>{%16} -> !hal.buffer_view
    util.return %19 : !hal.buffer_view
  }
}

What does "dispatch site" mean? There are ops like stream.*.dispatch @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD. Is it foo_dispatch_0_set_encoding_LHS_DxD in the example?

%11 = stream.async.dispatch
  on(#hal.device.affinity<@__device_0>)
  @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
    [%0, %1](%4[%c0 to %2 for %2], %0, %1)
  : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}

What do we do when we duplicate executable? Does it mean that we are cloning more func.func in stream.executable(builtin.module(...))? E.g.,

  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }

becomes

  stream.executable private @foo_dispatch_0 {
    stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
      stream.return %x, %y, %z : index, index, index
    }
    builtin.module {
      // The device name can be in suffix. I make it prefix for readability.
      func.func @llvmcpu_foo_dispatch_0_set_encoding_LHS_DxD {
         // the target field in the encoding becomes [#executable_target_embedded_elf_x86_64_ ]
      }
      func.func @vmvx_foo_dispatch_0_set_encoding_LHS_DxD {
         // the target field in the encoding becomes [#executable_target_vmvx_bytecode_fb ]
      }
      func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
        return
      }

Is it correct? If so, what should the main function look like? It was a single stream.async.dispatch op. Where do I add the logics of which dispatch site should be used. Do I add some if-else conditions in the main function. Or do I create an entry function in the executable, and we will make the decision there?

Also, how do I get the "execution affinity"? I assumed that it means the actual device that we're going to run on, is it correct?

%11 = stream.async.dispatch
  on(#hal.device.affinity<@__device_0>)
  @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
    [%0, %1](%4[%c0 to %2 for %2], %0, %1)
  : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}

It will be very helpful if we can look at the IR and modify few of them manually sometime next week!

benvanik commented 1 week ago

Let's chat next week, but I'm confused at what a solver is and why it needs anything at all on it. We shouldn't have any duplication. The solver is just a way to reference a function, essentially, and doesn't need any information of its own (unless there are encoding-specific information). Maybe we also need to change the name "solver" - that may be causing the confusion.

benvanik commented 1 week ago

(epic progress, though! it's really coming together :)

hanhanW commented 21 hours ago

I have a prototype that addresses the dup config issue. One of the challenges is that the attribute is not mutable, so we can not update some field once we create it. The other challenge is that the interface can't have parameters (which is fair). So my solution is to declare an interface method to get the config.

The prototype wraps the whole dictionary config into "encoding". In the HAL::ExecutableTargetAttr, I renamed the configuration to wrap_configuration and implemented a new getConfiguration (just a quick workaround for existing code) method. If there is a encoding, the method returns the config wrapped in the attribute. Otherwise, it returns the config.

Without this commit, the IR is:

#executable_target_vmvx_bytecode_fb =
  #hal.executable.target<
    "vmvx",
    "vmvx-bytecode-fb",
    {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>, {ukernels = "none"} }
>

With the commit, the IR is:

#executable_target_vmvx_bytecode_fb =
  #hal.executable.target<
    "vmvx",
    "vmvx-bytecode-fb",
    {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}
>

======

Side note:

I found a bug about dialect registration in few passes during the prototype. The overwritten getDependentTarget (from TargetBackend) is not used in AssignLegacyTargetDevices pass and ResolveDeviceAliases pass. So they are ignored, and only the HAL dialect (and the dialects that HAL depends on) are loaded at the pass level. I'll send the fix later.