Open hanhanW opened 4 months ago
@benvanik I implemented the AffinityAnalysisDialectInterface interface and a pass that attach the list of executables to targets
field. I'll switch to EncodingSolver later, it is more like a prototype.
I'm going to look at SpecializeEncodingsPass tomorrow. Just in case if I misunderstood our discussion, could you skim through the IR or implementation when you're available? The implementation only modifies the encodings in util.func
but not executables. My understanding is that the change of executables will happen SpecializeEncodingsPass which is not implemented yet.
Snippet of the IR dump:
// Before the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
// After the pass
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2],
round_dims_to = array<i64: 32, 32, 32>, targets = [#executable_target_vmvx_bytecode_fb]>>{%0, %1} : index
Looks promising! Some initial notes:
AffinityAnalysisDialectInterface should be in StreamInterfaces.td
We don't want MakeEncodingSolvable to be separate from specialization - they need to be updated together in the SpecializeEncodingsPass. We can't have the IR be in an inconsistent state where the host and device don't agree on the encodings.
We will need to modify all tensor types on any stream tensor op (there's a IREE::Stream::TensorPhaseOp
op trait indicating which ones are tensor ops) so that the types remain consistent.
Thanks for the note, it is helpful!
We can't have the IR be in an inconsistent state where the host and device don't agree on the encodings.
I see, makes sense!
We will need to modify all tensor types on any stream tensor op (there's a IREE::Stream::TensorPhaseOp op trait indicating which ones are tensor ops) so that the types remain consistent.
I found that TensorPhaseOp
is just a trait; all the stream tensor ops have a argument with "TypeAttr" type (e.g., encoding, result_encoding, target_encoding, etc.). I think we only need to update those TypeAttr
arguments. Do we need to update other tensor types?
the trait would allow for filtering to just those ops that may have tensors we want to change, instead of all ops in the program - so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs
so your update code instead of isa TensorSizeOfOp would be hasTrait TensorPhaseOp then walk the op change any type attrs
SG! I'm using tensor.sizeof op for prototype now; will switch it to TensorPhaseOp later. (I'd like to see at least one e2e workflow working.) But this is exactly what I'm looking for! In my prototype, I filter the ops with Stream_AffinityOp
interface; I'll switch to TensorPhaseOp
.
I made some progress and got stuck in specialization. The issues are mostly about how we gather affinities, clone dispatches and update encodings, especially for multi-device concept. I was going to ping @benvanik on discord, then I realized that it is Friday afternoon! So I'm leaving messages here, and hopefully we can walk through an example next Monday.
I moved the Stream dialect interface from analysis/ to IR/ and verified that there are no dependency issues. I finished the backend encoding solver prototype (using VMVX), and found that there is a duplication issue when we create the solver. The difficulty is that the solver needs to access the target config (like cpu_features, iree_gpu configurations, etc.). We can either (a) pass the dictionary through interface method (i.e., calculateStorageElementCountInBytes) or (b) store the information in the parameter (like below snippet).
The issue in (a) is that we need to hold the dictionary somewhere until we resolving all the encoding informations. It will make the EncodingAttr's targets
field holds a list of (Solver, either_target_executable_or_dictionary_configuration)
pairs. I don't find pair attribute, so we likely need to introduce a pair attribute in encoding dialect.
The issue in (b) is that we are duplicating the config twice in IRs. One is in solver, and the other is in ExecutableTargets.
def IREECPU_VMVXEncodingSolverAttr :
AttrDef<IREECPU_Dialect, "VMVXEncodingSolver", [
DeclareAttrInterfaceMethods<IREEEncoding_EncodingSolverInterfaceAttr, [
"calculateStorageElementCountInBytes",
]>
]> {
let mnemonic = "vmvx_encoding_solver";
let summary = "The encoding solver for VMVX backend";
let assemblyFormat = "`<` struct(params) `>`";
// The target configuration (from HAL::ExecutableTargetAttr) needs to be
// initalized. Otherwise, it is not able to resolve encodings.
let parameters = (ins
AttrParameter<"DictionaryAttr", "">:$target_configuration
);
}
Both solutions look bad to me. I think we need to let (c) ExecutableTargetAttr inherits from an Encoding attribute interface. It is something similar to what we discussed in the note:
ExecutableTargetAttr : EncodingAttrInterface
if (config.contains("encoding")) return encoding;
return nullptr;
#hal.executable.target<{
unified
encoding_solver = #iree_gpu.amdgpu_encoding_solver // additional field
}>
In getExecutableTarget methods, we can populate the attribute and store it to the encoding_solver
. We keep the ExecutableTarget attributes to the list and resolve the encodings in the specialization. Notedly, he HAL dialect only depends on Encoding dialect. IMO, it is a cleaner way.
The prototype now goes with (b) approach. It does not matter which one is implemented in the prototype. I'm not worried about it because it is solvable. I just need some input about which path should I go. I like (c) better, what do you think?
This part is hard to work out without an example. I'm at a state that can produce required encoding attribute, so I'd like to look at IR details with @benvanik together. My first step is creating the input IR, and studying multi-device concept. The writeup is good, btw.
I learned that a device could refer to a list of available devices. An AffinityAttr indicates a device (which can be a list and the device is selected from the list). My understanding was wrong because I thought that it includes all the devices..
Inlining the note to below, and I need @benvanik to help me unpack more context from it. I don't understand the terminology meaning of export
, execution affinity
, resource affinities
, and dispatch site
. There are two export ops. One is stream.executable.export
op, and the other is stream.tensor.export
op. Which one is the op that you mentioned in the note? Is it the executable.export op?
3. SpecializeEncodingsPass
a. gather per-export [execution affinity -> [resource affinities]] map
b. duplicate executable for each unique set of resource affinities
c. update dispatch site to new executable
d. update encoding attrs at all dispatch sites to executable targets
e. update encoding attrs in all bindings to executable targets
// export -> [affinity -> array per resource of affinities PVS]
DenseMap<ExecutableExportOp, SetVector<std::pair<AffinityAttr, ArrayAttr>>> exportDispatchSites;
per dispatch site:
each tensor has affinities per execution
tryLookupExecutionAffinity(dispatch)
tryLookupResourceAffinity(operand) / result
(may want to expose ValueConsumerAffinityPVS)
export key: [when executing on A, arg0=[A, B], arg1=[A]]
"when executing on A then need arg0=[A, B], arg1=[A]"
assume one execution affinity for now; no-op encoding if multiple
if export is opaque (hal.executable/etc) no-op encoding
per export:
f.e. unique site affinity
duplicate executable
update site to relevant duplicate executable (@ex1_a, @ex1_b)
f.e. dispatch site per affinity:
f.e. operand:
union affinities from all sites
get required targets
update encoding attr
Below snippet is inlined from the output of current MakeEncodingSolvable pass. Let's take the %11
as an example. The affinity is @__device_0
which has two device targets.
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {ukernels = "none"}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
module attributes {stream.affinity.default = #hal.device.affinity<@__device_0>} {
util.global private @__device_0 = #hal.device.select<[#device_target_local, #device_target_local1]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @foo(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%c0 = arith.constant 0 : index
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%element_type_f32 = hal.element_type<f32> : i32
%dense_row_major = hal.encoding_type<dense_row_major> : i32
hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%0, %1]) type(%element_type_f32) encoding(%dense_row_major)
%2 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %1} : index
%3 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg0 : !hal.buffer_view -> tensor<?x?xf32>{%0, %1} in !stream.resource<external>{%2}
%4 = stream.async.transfer %3 : !stream.resource<external>{%2} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%2}
%5 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%6 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%5, %6]) type(%element_type_f32) encoding(%dense_row_major)
%7 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%5, %6} : index
%8 = stream.tensor.import on(#hal.device.affinity<@__device_0>) %arg1 : !hal.buffer_view -> tensor<?x?xf32>{%5, %6} in !stream.resource<external>{%7}
%9 = stream.async.transfer %8 : !stream.resource<external>{%7} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<*>{%7}
%10 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %1} : index
%11 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%4[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
%12 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%5, %6} : index
%13 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_1::@foo_dispatch_1_set_encoding_RHS_DxD[%5, %6](%9[%c0 to %7 for %7], %5, %6) : (!stream.resource<*>{%7}, index, index) -> !stream.resource<*>{%12}
%14 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32, #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>{%0, %5} : index
%15 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_2::@foo_dispatch_2_matmul_DxDxD_f32[%1, %6, %0, %5](%11[%c0 to %10 for %10], %13[%c0 to %12 for %12], %1, %6, %0, %5) : (!stream.resource<*>{%10}, !stream.resource<*>{%12}, index, index, index, index) -> !stream.resource<*>{%14}
%16 = stream.tensor.sizeof on(#hal.device.affinity<@__device_0>) tensor<?x?xf32>{%0, %5} : index
%17 = stream.async.dispatch on(#hal.device.affinity<@__device_0>) @foo_dispatch_3::@foo_dispatch_3_unset_encoding_RESULT_DxD[%0, %5](%15[%c0 to %14 for %14], %0, %5) : (!stream.resource<*>{%14}, index, index) -> !stream.resource<*>{%16}
%18 = stream.async.transfer %17 : !stream.resource<*>{%16} from(#hal.device.affinity<@__device_0>) -> to(#hal.device.affinity<@__device_0>) !stream.resource<external>{%16}
%19 = stream.tensor.export on(#hal.device.affinity<@__device_0>) %18 : tensor<?x?xf32>{%0, %5} in !stream.resource<external>{%16} -> !hal.buffer_view
util.return %19 : !hal.buffer_view
}
}
What does "dispatch site" mean? There are ops like stream.*.dispatch @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
. Is it foo_dispatch_0_set_encoding_LHS_DxD
in the example?
%11 = stream.async.dispatch
on(#hal.device.affinity<@__device_0>)
@foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
[%0, %1](%4[%c0 to %2 for %2], %0, %1)
: (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
What do we do when we duplicate executable? Does it mean that we are cloning more func.func
in stream.executable(builtin.module(...))
? E.g.,
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
}
becomes
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
// The device name can be in suffix. I make it prefix for readability.
func.func @llvmcpu_foo_dispatch_0_set_encoding_LHS_DxD {
// the target field in the encoding becomes [#executable_target_embedded_elf_x86_64_ ]
}
func.func @vmvx_foo_dispatch_0_set_encoding_LHS_DxD {
// the target field in the encoding becomes [#executable_target_vmvx_bytecode_fb ]
}
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
// ...
return
}
Is it correct? If so, what should the main function look like? It was a single stream.async.dispatch
op. Where do I add the logics of which dispatch site should be used. Do I add some if-else conditions in the main function. Or do I create an entry function in the executable, and we will make the decision there?
Also, how do I get the "execution affinity"? I assumed that it means the actual device that we're going to run on, is it correct?
%11 = stream.async.dispatch
on(#hal.device.affinity<@__device_0>)
@foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD
[%0, %1](%4[%c0 to %2 for %2], %0, %1)
: (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%10}
It will be very helpful if we can look at the IR and modify few of them manually sometime next week!
Let's chat next week, but I'm confused at what a solver is and why it needs anything at all on it. We shouldn't have any duplication. The solver is just a way to reference a function, essentially, and doesn't need any information of its own (unless there are encoding-specific information). Maybe we also need to change the name "solver" - that may be causing the confusion.
(epic progress, though! it's really coming together :)
I have a prototype that addresses the dup config issue. One of the challenges is that the attribute is not mutable, so we can not update some field once we create it. The other challenge is that the interface can't have parameters (which is fair). So my solution is to declare an interface method to get the config.
The prototype wraps the whole dictionary config into "encoding". In the HAL::ExecutableTargetAttr, I renamed the configuration
to wrap_configuration
and implemented a new getConfiguration
(just a quick workaround for existing code) method. If there is a encoding, the method returns the config wrapped in the attribute. Otherwise, it returns the config.
Without this commit, the IR is:
#executable_target_vmvx_bytecode_fb =
#hal.executable.target<
"vmvx",
"vmvx-bytecode-fb",
{encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>, {ukernels = "none"} }
>
With the commit, the IR is:
#executable_target_vmvx_bytecode_fb =
#hal.executable.target<
"vmvx",
"vmvx-bytecode-fb",
{encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}
>
======
Side note:
I found a bug about dialect registration in few passes during the prototype. The overwritten getDependentTarget
(from TargetBackend) is not used in AssignLegacyTargetDevices pass and ResolveDeviceAliases pass. So they are ignored, and only the HAL dialect (and the dialects that HAL depends on) are loaded at the pass level. I'll send the fix later.
I wrote an example about running one matmul on device_a and the same matmul on the device_b; it gives me the multi-device IR that we want to solve in SpecializeEncoding pass.
I put some critical IRs in the snippet, and now I think I understand what we want to duplicate for executables. The set_encoding_LHS dispatch is used by both device, while they are referring to the same function. We need to duplicate the site (i.e., functions inside an export) and update the site to relevant duplicate executables.
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// ....
The previous one (e.g., device = [#device_a, #device_b]
) that has multiple targets in a device is something we need to treat like no-op encoding case. I think this is the scope of the work.
I'll start implementing the rest part of SpecializeEncoding pass, and share the update.
Note: here is the example input that I used in the prototype. What the IR does is
module {
func.func @foo(%arg0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %arg1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
%dim = tensor.dim %arg0, %c0 : tensor<?x?xf32>
%dim_0 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
%dim_1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
%cst = arith.constant 0.000000e+00 : f32
%0 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<?x?xf32>) -> tensor<?x?xf32>
%2 = linalg.matmul ins(%arg0, %arg1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%1 : tensor<?x?xf32>) -> tensor<?x?xf32>
%3 = flow.tensor.transfer %2 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_b>
%4 = flow.tensor.transfer %arg0 : tensor<?x?xf32>{%dim, %dim_0} to #hal.device.promise<@device_b>
%5 = flow.tensor.transfer %arg1 : tensor<?x?xf32>{%dim_0, %dim_1} to #hal.device.promise<@device_b>
%6 = tensor.empty(%dim, %dim_1) : tensor<?x?xf32>
%7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<?x?xf32>) -> tensor<?x?xf32>
%8 = linalg.matmul ins(%4, %5 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%7 : tensor<?x?xf32>) -> tensor<?x?xf32>
%9 = arith.addf %3, %8 : tensor<?x?xf32>
%10 = flow.tensor.transfer %9 : tensor<?x?xf32>{%dim, %dim_1} to #hal.device.promise<@device_a>
return %10 : tensor<?x?xf32>
}
}
Commands to generate the IR: build/tools/iree-compile --iree-execution-model=async-external --iree-hal-target-device="device_a=vmvx" --iree-hal-target-device="device_b=llvm-cpu" --iree-hal-local-target-device-backends=vmvx --iree-hal-local-target-device-backends=llvm-cpu ~/repro.mlir -o /tmp/z.vmfb --mlir-print-ir-after-all --mlir-disable-threading 2> ~/log2 --iree-global-opt-enable-early-materialization=false
. (There might be redundant CLI flags, but it's fine IMO. I just want to get the IR before SpecializeEncoding pass.)
I have a second take for dup config issue w/o HAL attribute changes. It is still creating an additional level of wrapping but it is scoped within the Codegen directory. I.e., I create a new method (i.e., getTargetConfig(HAL::ExecutableTarget)
in Codegen/Utils to resolve optional encoding solver wrapper. Instead of using targetAttr.getConfiguration()
, we'll need to use getTargetConfig(targetAttr)
in Codegen. I'll chat with @MaheshRavishankar in tomorrow's meeting.
It looks cleaner because the host/HAL side does not really care about the configuration field (IMO). They are target features and should(?) only be used by Codegen backends.
Although I haven't finished the update of cloned executable part, but it looks like I'm doing something wrong. So I posted the update here and I'm looking for feedback.
So I have a commit, which collects the "export -> affinities" maps and duplicates the stream.executable ops; the commit also updates the entry points of each stream.async.dispatch
ops. However, it crashes in HAL::MaterializeInterface pass, which makes me feel that I'm doing something wrong. The error is from invalid casting in BindingLayoutAnalysis.
decltype(auto) llvm::dyn_cast(From *) [To = mlir::iree_compiler::IREE::Stream::ExecutableExportOp, From = mlir::Operation]: Assertion `detail::isPresent(Val) && "dyn_cast on a non-existent value"' failed.
IR before my SpecializedEncoding pass:
#executable_target_embedded_elf_x86_64_ = #hal.executable.target<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : i64, target_triple = "x86_64-unknown-unknown-eabi-elf"}>
#executable_target_vmvx_bytecode_fb = #hal.executable.target<"vmvx", "vmvx-bytecode-fb", {encoding_solver = #iree_cpu.vmvx_encoding_solver<target_configuration = {ukernels = "none"}>}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#device_target_local = #hal.device.target<"local", [#executable_target_vmvx_bytecode_fb]> : !hal.device
#device_target_local1 = #hal.device.target<"local", [#executable_target_embedded_elf_x86_64_]> : !hal.device
stream.executable private @foo_dispatch_0 {
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
util.func public @foo(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.fence, %arg3: !hal.fence) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "async func @foo(%input0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}, %input1: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>}) -> (%output0: tensor<?x?xf32> {iree.abi.affinity = #hal.device.promise<@device_a>})", iree.abi.model = "coarse-fences"}} {
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// ...
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
// ....
IR after my SpecializedEncoding pass:
stream.executable private @foo_dispatch_0 { ... }
// cloned executable
stream.executable private @foo_dispatch_0_0 {
// The body is the same, no changed.
stream.executable.export public @foo_dispatch_0_set_encoding_LHS_DxD workgroups(%arg0: index, %arg1: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg0, %arg1
stream.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @foo_dispatch_0_set_encoding_LHS_DxD(%arg0: !stream.binding, %arg1: index, %arg2: index, %arg3: !stream.binding) {
%c0 = arith.constant 0 : index
%0 = flow.dispatch.workload.ordinal %arg1, 0 : index
%1 = flow.dispatch.workload.ordinal %arg2, 1 : index
%2 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1}
%3 = stream.binding.subspan %arg3[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1,
#map2], round_dims_to = array<i64: 32, 32, 32>>>>{%0, %1}
%4 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%0, %1} -> tensor<?x?xf32>
%5 = iree_encoding.set_encoding %4 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64:
32, 32, 32>>>
flow.dispatch.tensor.store %5, %3, offsets = [0, 0], sizes = [%0, %1], strides = [1, 1] : tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #
map2], round_dims_to = array<i64: 32, 32, 32>>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_di
ms_to = array<i64: 32, 32, 32>>>>{%0, %1}
return
}
}
}
// ...
%14 = stream.async.dispatch on(#hal.device.affinity<@device_a>) @foo_dispatch_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%6[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%13}
// The updated stream.async.dispatch op now has @foo_dispatch_0_0::@... entry point.
%25 = stream.async.dispatch on(#hal.device.affinity<@device_b>) @foo_dispatch_0_0::@foo_dispatch_0_set_encoding_LHS_DxD[%0, %1](%22[%c0 to %2 for %2], %0, %1) : (!stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%24}
@benvanik do I clone the stream.executable
for each unique "operands affinities list"? Or do I add new stream.executable.export public
op to the original stream.executable
region and dup the func.func
op instead?
Adding the status update to the issue:
https://github.com/iree-org/iree/pull/18738 this PR has the prototype, and you can find the design doc at https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e
We had a great brainstorm yesterday, and here are the required items to land the prototype to the main branch:
We also chat about the case that a device has several executable targets. IMO, we’re able to specialize the case in my prototype. It will be the next TODO after I land my prototype to the main branch.
The other next topics in my mind is to cancel encodings properly. Ben suggested me to look it at flow level, and turn them into flow.clone ops when we know that the potential targets do not implement encodings.
The information can be queried by the same analysis -- but it's fuzzier. It's on my TODO list.
I have a prototype for encoding information compression. It still carry the whole config (as an intermediate step) when we populate the attributes from HALAffinityAnalysisDialectInterface implementation. The main difference is that we introduce an cloneWithSimplifiedConfig
interface method and calls the method when updating the encodings. In the final IR, the encoding_solver attributes have the serialized MaterializeEncodingInfo which defines the layouts in data-tiling. See below snippet for the final IR dump.
On the codegen side, there are two groups of encodings. One has encoding_solver and the other does not. The boundary operations (e.g., hal.binding, flow.dispatch.tensor.load/store, etc) have solver attributes which describes that the incoming layout. The other operations do not have solve attributes and they are all compute ops, which means that they will be executed on the device that attached on hal.executable.variant
. In the materialization, we'll need to update the logics for boundary operations. If they have different layout, we'll need to undo the relayout from device_a and reapply the relayout for device_b. Those undo-relayout operations need to be queried from the encoding solver attribute. In my prototype, I cancel the encodings when the layout mismatch. It is just not implemented in my prototype. It's fixable. This way avoids the encoding propagation, and reduce the difficulty that Mahesh pointed out.
@benvanik @bjacob @qedawkins @MaheshRavishankar Does the way that I define layout look good to you?
#encoding_solver = #iree_cpu.cpu_encoding_solver<>
#encoding_solver1 = #iree_cpu.vmvx_encoding_solver<>
#encoding_solver2 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 1], outerDimsPerm = [0, 1]}>
#encoding_solver3 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [0, 1]}>
#encoding_solver4 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [16, 1], outerDimsPerm = [1, 0]}>
#encoding_solver5 = #iree_cpu.vmvx_encoding_solver<target_configuration = {innerDimsPos = [1, 0], innerTileSizes = [-9223372036854775808, -9223372036854775808], outerDimsPerm = [1, 0]}>
#encoding_solver6 = #iree_cpu.cpu_encoding_solver<target_configuration = {innerDimsPos = [0, 1], innerTileSizes = [16, 16], outerDimsPerm = [0, 1]}>
#map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
#map3 = affine_map<(d0, d1) -> (d0, d1)>
#encoding = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver2]>
#encoding1 = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding2 = #iree_encoding.encoding<operand_index = 0 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]>
#encoding3 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver4]>
#encoding4 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding5 = #iree_encoding.encoding<operand_index = 1 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver5]>
#encoding6 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver6]>
#encoding7 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 32, 32, 32>>
#encoding8 = #iree_encoding.encoding<operand_index = 2 : index, op_type = matmul, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2], targets = [#encoding_solver3]>
Okay, I have verified that e2e is working with the new changes. I still don't get a good name for the attribute interface, perhaps I'll just call it EncodingAttrInterface for now, and we can always rename it later. So I'm going to start breaking down my prototype and land it to the main branch.
I think we can name it to EncodingLayoutAttrInterface
. All the methods are about layouts, e.g., storage size calculation, materialized layout shape, generating operations for device layout, etc. I also want to replace the new targets
field with layouts
in the EncodingAttr.
The branch demonstrates how data-tiling + heterogeneous computing run altogether in IREE: https://github.com/iree-org/iree/pull/18738
Design Doc: https://hackmd.io/@hwPnnvLBTB-JGVMeh-bCEA/Sy9nvDhb1e
IR dump: https://gist.github.com/hanhanW/5029dc652aec1379102e43e702aaf15b
How I think about buffer allocation in data-tiling
What we can get from here is:
Execution Plan
Retire the query_upper_bound op and CPUMaterializeUpperBoundTileSize pass.
Goal: remove old operations and decouple the deps between HAL and the CPU specific pass.
Plan: Update the max_padding semantic in the encoding. If it is set, the backend should take it into account and select appropriate inner tile sizes (to avoid out-of-bound access). If it is not set, the backend can decide whatever inner tile sizes they want. In the current default path (which will eventually be moved to preprocessing), we do not set max_padding attribute. In the path that we're building, we set max_padding attribute to hint the actual buffer size for Stream.
Finish the data-tiling fusion and basic functional GPU data-tiling
See https://github.com/iree-org/iree/issues/17722 for more details. Basically we want to enable fusion for mmt4d ops on CPU side, and build the data-tiling for GPU path. There are some changes needed by CPU backend, because mmt4d fusion is new. It is scoped in the #17722 issue.
Outcome: we'll be able to flip data-tiling to the fusion path and use data-tiling in multi-device project.
Move SetEncoding and MaterializationEncoding from GlobalOpt to preprocessing
Learn buffer allocation for multi-device (i.e., LCM?)
More items: TBD
cc @MaheshRavishankar @benvanik @bjacob @Max191 @pashu123 @lialan