Open monorimet opened 5 months ago
--iree-vulkan-target-triple
is missing so this goes to compile against the most restrictive swiftshader target. don't think we would prioritize fixing that target for now. could you add rdna targets to try out?
Ran into the same issue with the following two CLI inputs:
with SRT, all flags normally used for this config in SHARK:
iree-compile.exe C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_decode.torch.tempfile --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-max-allocation-size=3221225472 --iree-vulkan-target-env='#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control, VK_KHR_cooperative_matrix], AMD:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 65536, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 64, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 64, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, cooperativeMatrixPropertiesKHR = [#vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f16, resultType = f16, accSat = false, scope = #vk.scope<Subgroup>>], shaderIntegerDotProduct = unit >>' --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=false --iree-flow-collapse-reduction-dims --iree-opt-const-expr-hoisting=False --iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-global-opt-detach-elementwise-from-named-ops,iree-global-opt-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-convert-conv2d-to-img2col,iree-preprocessing-pad-linalg-ops{pad-size=32},iree-linalg-ext-convert-conv2d-to-winograd))' --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-preprocessing-pad-linalg-ops{pad-size=16}))' --iree-vulkan-target-triple=rdna3-7900-windows-msvc
minimized, with target triple flag:
iree-compile.exe C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_decode.torch.tempfile --iree-input-type=torch --iree-vulkan-target-triple=rdna3-7900-windows --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan -o vae.vmfb
I managed to prevent the failure on torch.aten.convolution
by cleaning up some preprocessing flags:
--iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-global-opt-detach-elementwise-from-named-ops,iree-preprocessing-convert-conv2d-to-img2col,iree-global-opt-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-pad-linalg-ops{pad-size=32},iree-linalg-ext-convert-conv2d-to-winograd))'
but I'm still seeing issues with:
Diagnostics:
<unknown>:0: error: failed to legalize operation 'arith.constant'
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2874:13: error: Failures have been detected while processing an MLIR pass pipeline
%1210 = torch.prims.convert_element_type %1209, %int6_1284 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2874:13: note: Pipeline failed while executing [`TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_220, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_229, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_237, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_245, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_253, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `TranslateExecutablesPass` on 'hal.executable' operation: @main_dispatch_261, `TranslateTargetExecutableVariantsPass` on 'hal.executable.variant' operation: @vulkan_spirv_fb, `FlattenMemRefSubspan` on 'builtin.module' operation, `FlattenMemRefSubspan` on 'builtin.module' operation, `FlattenMemRefSubspan` on 'builtin.module' operation, `FlattenMemRefSubspan` on 'builtin.module' operation, `ConvertToSPIRV` on 'builtin.module' operation, `ConvertToSPIRV` on 'builtin.module' operation]: reproducer generated at `./shark_tmp/core-reproducer.mlir`
%1210 = torch.prims.convert_element_type %1209, %int6_1284 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2874:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>]>>}>
%1210 = torch.prims.convert_element_type %1209, %int6_1284 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2874:13: error: failed to translate executables
%1210 = torch.prims.convert_element_type %1209, %int6_1284 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
<unknown>:0: error: failed to legalize operation 'arith.constant'
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2987:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>]>>}>
%1262 = torch.prims.convert_element_type %1261, %int6_1340 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:2987:13: error: failed to translate executables
%1262 = torch.prims.convert_element_type %1261, %int6_1340 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
<unknown>:0: error: failed to legalize operation 'arith.constant'
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:3075:13: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>]>>}>
%1304 = torch.prims.convert_element_type %1303, %int6_1383 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:250:3: note: called from
func.func @main(%arg0: tensor<1x4x64x64xf16>) -> tensor<1x3x512x512xf16> attributes {torch.args_schema = "[1, {\22type\22: \22builtins.tuple\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: \22builtins.list\22, \22context\22: \22null\22, \22children_spec\22: [{\22type\22: null, \22context\22: null, \22children_spec\22: []}]}, {\22type\22: \22builtins.dict\22, \22context\22: \22[]\22, \22children_spec\22: []}]}]", torch.return_schema = "[1, {\22type\22: null, \22context\22: null, \22children_spec\22: []}]"} {
^
C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile:3075:13: error: failed to translate executables
%1304 = torch.prims.convert_element_type %1303, %int6_1383 : !torch.vtensor<[1,32,4,262144],f16>, !torch.int -> !torch.vtensor<[1,32,4,262144],f32>
Narrowed down to the attached dispatch, can reproduce with:
iree-compile.exe dispatch_189.mlir --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --mlir-pass-pipeline-crash-reproducer=./shark_tmp/core-reproducer.mlir --iree-llvmcpu-target-cpu-features=host --iree-stream-resource-max-allocation-size=3221225472 --iree-vulkan-target-env='#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control, VK_KHR_cooperative_matrix], AMD:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 65536, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 64, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 64, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, cooperativeMatrixPropertiesKHR = [#vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f16, resultType = f16, accSat = false, scope = #vk.scope<Subgroup>>], shaderIntegerDotProduct = unit >>' --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=false --iree-opt-const-expr-hoisting=False --iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-global-opt-detach-elementwise-from-named-ops,iree-preprocessing-convert-conv2d-to-img2col,iree-global-opt-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-pad-linalg-ops{pad-size=16}))' --iree-vulkan-target-triple=rdna3-7900-windows-msvc
hal.executable public @main_dispatch_189 {
hal.executable.variant public @vulkan_spirv_fb target(<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>]>>}>) {
hal.executable.export public @main_dispatch_189_matmul_256x262144x2304_f16 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>, <3, storage_buffer>]>]>) {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main_dispatch_189_matmul_256x262144x2304_f16() {
%cst = arith.constant 0.000000e+00 : f16
%c95657600 = arith.constant 95657600 : index
%c96837248 = arith.constant 96837248 : index
%c403703808 = arith.constant 403703808 : index
%c0 = arith.constant 0 : index
%c134217728 = arith.constant 134217728 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c95657600) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x2304xf16>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c403703808) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2304x262144xf16>>
%2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c96837248) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256xf16>>
%3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<256x262144xf16>>
%4 = hal.interface.binding.subspan set(0) binding(3) type(storage_buffer) alignment(64) offset(%c134217728) : !flow.dispatch.tensor<writeonly:tensor<256x262144xf32>>
%5 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [256, 2304], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x2304xf16>> -> tensor<256x2304xf16>
%6 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [2304, 262144], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2304x262144xf16>> -> tensor<2304x262144xf16>
%7 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [256], strides = [1] : !flow.dispatch.tensor<readonly:tensor<256xf16>> -> tensor<256xf16>
%8 = tensor.empty() : tensor<256x262144xf32>
%9 = tensor.empty() : tensor<256x262144xf16>
%10 = linalg.fill ins(%cst : f16) outs(%9 : tensor<256x262144xf16>) -> tensor<256x262144xf16>
%11 = linalg.matmul ins(%5, %6 : tensor<256x2304xf16>, tensor<2304x262144xf16>) outs(%10 : tensor<256x262144xf16>) -> tensor<256x262144xf16>
%12:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11, %7 : tensor<256x262144xf16>, tensor<256xf16>) outs(%9, %8 : tensor<256x262144xf16>, tensor<256x262144xf32>) {
^bb0(%in: f16, %in_0: f16, %out: f16, %out_1: f32):
%13 = arith.addf %in, %in_0 : f16
%14 = arith.extf %13 : f16 to f32
linalg.yield %13, %14 : f16, f32
} -> (tensor<256x262144xf16>, tensor<256x262144xf32>)
flow.dispatch.tensor.store %12#0, %3, offsets = [0, 0], sizes = [256, 262144], strides = [1, 1] : tensor<256x262144xf16> -> !flow.dispatch.tensor<writeonly:tensor<256x262144xf16>>
flow.dispatch.tensor.store %12#1, %4, offsets = [0, 0], sizes = [256, 262144], strides = [1, 1] : tensor<256x262144xf32> -> !flow.dispatch.tensor<writeonly:tensor<256x262144xf32>>
return
}
}
}
hal.executable.variant public @vulkan_spirv_fb_0 target(<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, PhysicalStorageBufferAddresses, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_physical_storage_buffer, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>) {
hal.executable.export public @main_dispatch_189_matmul_256x262144x2304_f16 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>, <3, storage_buffer>]>]>) {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main_dispatch_189_matmul_256x262144x2304_f16() {
%cst = arith.constant 0.000000e+00 : f16
%c95657600 = arith.constant 95657600 : index
%c96837248 = arith.constant 96837248 : index
%c403703808 = arith.constant 403703808 : index
%c0 = arith.constant 0 : index
%c134217728 = arith.constant 134217728 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c95657600) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256x2304xf16>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c403703808) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<2304x262144xf16>>
%2 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c96837248) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<256xf16>>
%3 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<256x262144xf16>>
%4 = hal.interface.binding.subspan set(0) binding(3) type(storage_buffer) alignment(64) offset(%c134217728) : !flow.dispatch.tensor<writeonly:tensor<256x262144xf32>>
%5 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [256, 2304], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<256x2304xf16>> -> tensor<256x2304xf16>
%6 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [2304, 262144], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2304x262144xf16>> -> tensor<2304x262144xf16>
%7 = flow.dispatch.tensor.load %2, offsets = [0], sizes = [256], strides = [1] : !flow.dispatch.tensor<readonly:tensor<256xf16>> -> tensor<256xf16>
%8 = tensor.empty() : tensor<256x262144xf32>
%9 = tensor.empty() : tensor<256x262144xf16>
%10 = linalg.fill ins(%cst : f16) outs(%9 : tensor<256x262144xf16>) -> tensor<256x262144xf16>
%11 = linalg.matmul ins(%5, %6 : tensor<256x2304xf16>, tensor<2304x262144xf16>) outs(%10 : tensor<256x262144xf16>) -> tensor<256x262144xf16>
%12:2 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%11, %7 : tensor<256x262144xf16>, tensor<256xf16>) outs(%9, %8 : tensor<256x262144xf16>, tensor<256x262144xf32>) {
^bb0(%in: f16, %in_0: f16, %out: f16, %out_1: f32):
%13 = arith.addf %in, %in_0 : f16
%14 = arith.extf %13 : f16 to f32
linalg.yield %13, %14 : f16, f32
} -> (tensor<256x262144xf16>, tensor<256x262144xf32>)
flow.dispatch.tensor.store %12#0, %3, offsets = [0, 0], sizes = [256, 262144], strides = [1, 1] : tensor<256x262144xf16> -> !flow.dispatch.tensor<writeonly:tensor<256x262144xf16>>
flow.dispatch.tensor.store %12#1, %4, offsets = [0, 0], sizes = [256, 262144], strides = [1, 1] : tensor<256x262144xf32> -> !flow.dispatch.tensor<writeonly:tensor<256x262144xf32>>
return
}
}
}
}
Update: Been playing around with this for a bit. I think I was fumbling into other issues above.
This is a more simple approach with less CLI stuff happening:
To get dispatches from VAE e2e mlir:
iree-compile.exe C:\V\SHARK\apps\shark_studio\web\shark_tmp\vae_encode.torch.tempfile --iree-vulkan-target-triple=rdna3-7900-windows --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-util-zero-fill-elided-attrs --iree-opt-const-expr-hoisting=False --iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 --iree-preprocessing-pass-pipeline='builtin.module(func.func(iree-global-opt-detach-elementwise-from-named-ops,iree-preprocessing-convert-conv2d-to-img2col,iree-global-opt-convert-1x1-filter-conv2d-to-matmul,iree-preprocessing-pad-linalg-ops{pad-size=32}))' --compile-to=hal --iree-hal-dump-executable-configurations-to=./vae_dispatches --mlir-print-debuginfo=false --mlir-print-op-on-diagnostic=false 2> out_e2e.txt
Result:
out_e2e.txt
To attempt to compile one of the dispatches shown in the error from above command:
iree-compile.exe C:\V\SHARK\vae_dispatches\configured_compiled_vae_main_dispatch_196.mlir --iree-input-type=torch --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-opt-const-expr-hoisting=False --iree-codegen-linalg-max-constant-fold-elements=9223372036854775807 --iree-vulkan-target-triple=rdna3-7900-windows-msvc --compile-from=executable-configurations 2> out.txt
We have a arith.constant dense_resource<torch_tensor_4_torch.float16> : tensor<4xf16>
op in the IR before converting to SPIR-V. Support for that is not yet implemented in converting to SPIR-V.
@MaheshRavishankar I haven't followed the development on dense resource per se--is it expected to appear in the kernel? If so the resource it references should also appear in the dispatch (currently missing)?
What happened?
Running iree-compile on VAE torch IR for SPIR-V backend results in the following error:
This is believed to be an unhandled large vector case in convolution lowerings, but I don't have enough context here to corroborate.
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
20240212.799
Additional context
This was functional in mid-december, sure version to try would be from 2023/12/12 -- can bisect further if needed.