nod-ai / SHARK

SHARK - High Performance Machine Learning Distribution
Apache License 2.0
1.4k stars 170 forks source link

IREE Compile Error (VAE/LoRA on Vulkan/RX7900) #1925

Open njsharpe opened 8 months ago

njsharpe commented 8 months ago

With the most recent git pull (7963abb) and a clean environment, only some Model/VAE/LoRA combinations produce the compile issue below. In previous versions, this issue was not present (specifically 0361db4 was the last commit I had installed). I have tried various combinations and have failed to find a potential root cause. Any and all help is appreciated.

System Information

SystemExit: Error invoking IREE compiler tool iree-compile.exe
Error code: 1
Diagnostics:
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:28:12: error: 'func.func' op failed to get lowering configuration
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:28:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:28:12: error: failed to serialize executables
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:33:14: error: 'func.func' op failed to get lowering configuration
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:33:14: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>
<eval_with_key>.33 from D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:33:14: error: failed to serialize executables

Invoked with:
 iree-compile.exe D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\iree\compiler\tools\..\_mlir_libs\iree-compile.exe .\unet512_1_64_512_512_fp16_tuned_lilymix_v2_SecretSauce610-fp16-pruned_vulkan_torch_linalg.mlir --iree-input-type=tm_tensor --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvmcpu-target-cpu-features=host --iree-vulkan-target-env=#vk.target_env<v1.3, r(120), [VK_KHR_16bit_storage, VK_KHR_8bit_storage, VK_KHR_shader_float16_int8, VK_KHR_spirv_1_4, VK_KHR_storage_buffer_storage_class, VK_KHR_variable_pointers, VK_EXT_subgroup_size_control, VK_KHR_cooperative_matrix], AMD:DiscreteGPU, #vk.caps< maxComputeSharedMemorySize = 65536, maxComputeWorkGroupInvocations = 1024, maxComputeWorkGroupSize = dense<[1024, 1024, 1024]>: vector<3xi32>, subgroupSize = 64, subgroupFeatures = 255: i32, minSubgroupSize = 32, maxSubgroupSize = 64, shaderFloat16 = unit, shaderFloat64 = unit, shaderInt8 = unit, shaderInt16 = unit, shaderInt64 = unit, storageBuffer16BitAccess = unit, storagePushConstant16 = unit, uniformAndStorageBuffer16BitAccess = unit, storageBuffer8BitAccess = unit, storagePushConstant8 = unit, uniformAndStorageBuffer8BitAccess = unit, variablePointers = unit, variablePointersStorageBuffer = unit, cooperativeMatrixPropertiesKHR = [#vk.coop_matrix_props<mSize = 16, nSize = 16, kSize = 16, aType = f16, bType = f16, cType = f16, resultType = f16, accSat = false, scope = #vk.scope<Subgroup>>], shaderIntegerDotProduct = unit >> --iree-stream-resource-max-allocation-size=4294967295 --iree-vm-bytecode-module-strip-source-map=true --iree-util-zero-fill-elided-attrs --iree-opt-strip-assertions=true --verify=false -iree-vulkan-target-triple=rdna3-7900-windows

Need more information? Set IREE_SAVE_TEMPS=/some/dir in your environment to save all artifacts and reproducers.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\starlette\routing.py", line 686, in lifespan
    await receive()
  File "D:\Stable\stable-diffusion-shark\shark.venv\Lib\site-packages\uvicorn\lifespan\on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.1776.0_x64__qbz5n2kfra8p0\Lib\asyncio\queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
stellaraccident commented 8 months ago

Thank you. Will get this looked at today.

(I also note that turbine isn't preserving source location properly and that can be fixed)

monorimet commented 8 months ago

Thanks for the heads up. I'm looking into this right now, trying to reproduce. If there are any patterns in which unet/vae/lora combinations I should look at, let me know.

njsharpe commented 8 months ago

@monorimet The specific combination used in this example was as follows:

Model: Lily Mix v2 VAE: kl-f8-animev2 LoRA: Secret Sauce - 610

Each of these links is a civitai download link directly to the versions used. That being said, many other combinations of Model, VAE and LoRA have caused this issue.

monorimet commented 8 months ago

@monorimet The specific combination used in this example was as follows:

Model: Lily Mix v2

VAE: kl-f8-animev2

LoRA: Secret Sauce - 610

Each of these links is a civitai download link directly to the versions used. That being said, many other combinations of Model, VAE and LoRA have caused this issue.

Thanks, I'll find out what's confusing the pipeline.

monorimet commented 8 months ago

Something changes in the unet model when we use LoRAs. It seems to work fine without the tunings, but with tunings I reproduce your error.

image

I'll hypothesize that the bad combo is RDNA3 tuned + SD1.X UNet + LoRA weights since the custom VAE shouldn't impact UNet compilation. I'll take your word that only certain models/loras do this and try to figure out what's happening exactly.

njsharpe commented 8 months ago

We can operate under the assumption that the VAE does not affect anything, since that much has been more or less unproven since I use a VAE in every generation. The theory about tuned vs. untuned is supported because, using the same setup after a --clear_all, remove the LoRA and re-generate with your prompt. The error does not occur. I have also verified that, on first generation (untuned), SHARK functions as expected.

EDIT: Remove dumb question. As a temporary workaround, I will used the --no_use_tuned flag to avoid the issue all together.

monorimet commented 8 months ago

Probably the tuned LoRA configuration, I don't see this without the LoRA or without tuning. We probably have to make some tuning config changes, I'll narrow this down and disable tunings for the failure modes and file a separate issue for a gentle rework of the tunings to better handle some of these custom model configs.

monorimet commented 8 months ago

It looks like this happens without LoRAs now, too... I've got a minimal reproducer with the following IR:

https://storage.googleapis.com/shark-public/ean/unet_lc/module_forward_dispatch_5.mlir

%7 = linalg.matmul_transpose_b {compilation_info = #iree_codegen.compilation_info<lowering_config = <tile_sizes = [[32, 16], [16, 16], [0, 0, 64], [16, 16, 16]]>, translation_info = <SPIRVCooperativeMatrixVectorize pipeline_depth = 1 store_stage = 0>, workgroup_size = [32, 2, 1], subgroup_size = 32>} ins(%3, %4 : tensor<32x320xf16>, tensor<1280x320xf16>) outs(%6 : tensor<32x1280xf16>) -> tensor<32x1280xf16> loc("aten::addmm"("<eval_with_key>.12 from C:\\V\\SHARK\\shark.venv\\Lib\\site-packages\\torch\\fx\\experimental\\proxy_tensor.py:506 in wrapped":28:12))

The error (Also on the CI now):

<eval_with_key>.12 from C:\actions-runner\w\SHARK\SHARK\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:28:12: error: 'func.func' op failed to get lowering configuration

<eval_with_key>.12 from C:\actions-runner\w\SHARK\SHARK\shark.venv\Lib\site-packages\torch\fx\experimental\proxy_tensor.py:507 in wrapped:28:12: error: failed to run translation of source executable to target executable for backend #hal.executable.target<"vulkan", "vulkan-spirv-fb", {spirv.target_env = #spirv.target_env<#spirv.vce<v1.6, [Shader, Float64, Float16, Int64, Int16, Int8, StorageBuffer16BitAccess, StorageUniform16, StoragePushConstant16, StorageBuffer8BitAccess, UniformAndStorageBuffer8BitAccess, StoragePushConstant8, GroupNonUniform, GroupNonUniformVote, GroupNonUniformArithmetic, GroupNonUniformBallot, GroupNonUniformShuffle, GroupNonUniformShuffleRelative, GroupNonUniformClustered, GroupNonUniformQuad, VariablePointers, VariablePointersStorageBuffer, DotProduct, DotProductInputAll, DotProductInput4x8BitPacked, DotProductInput4x8Bit, CooperativeMatrixKHR], [SPV_KHR_16bit_storage, SPV_KHR_8bit_storage, SPV_KHR_integer_dot_product, SPV_KHR_storage_buffer_storage_class, SPV_KHR_variable_pointers, SPV_KHR_cooperative_matrix]>, api=Vulkan, AMD:DiscreteGPU, #spirv.resource_limits<max_compute_shared_memory_size = 65536, max_compute_workgroup_invocations = 1024, max_compute_workgroup_size = [1024, 1024, 1024], subgroup_size = 64, min_subgroup_size = 32, max_subgroup_size = 64, cooperative_matrix_properties_khr = [#spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = i8, b_type = i8, c_type = i32, result_type = i32, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f16, result_type = f16, acc_sat = false, scope = <Subgroup>>, #spirv.coop_matrix_props_khr<m_size = 16, n_size = 16, k_size = 16, a_type = f16, b_type = f16, c_type = f32, result_type = f32, acc_sat = false, scope = <Subgroup>>]>>}>

Running the reproducer CLI input gives the dispatch numbers 5, 7, 22, [...] which are matmul ops:

https://storage.googleapis.com/shark-public/ean/unet_lc/module_forward_dispatch_7.mlir

https://storage.googleapis.com/shark-public/ean/unet_lc/module_forward_dispatch_22.mlir

one-lithe-rune commented 8 months ago

Currently getting something that looks the same as this on anything tuned. I think with differing eval key .<number> though. (Also getting the arith.maxf complaint on everything).

Probably irrelevant thing I noticed, is that I also get this on untuned SD 2.1base, if I set the wrong --max_length. So setting it to 77 rather than 64.

stellaraccident commented 8 months ago

@monorimet just checking whether this is making progress?