intel / mlir-extensions

Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.
Other
124 stars 44 forks source link

[Triton] To inline the VC intrinsic in the SIMT kernel. #658

Open chengjunlu opened 1 year ago

chengjunlu commented 1 year ago

Background

The Triton kernel is generated as SIMT major SPIRV kernel. It is because some component has to be used with SIMT paradigm. Like: Intel math library is only SIMT version. But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.

We are working on enabling SIMT->SIMD calling convention on Triton kernel. https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_invoke_simd.asciidoc

By doing so, we can codegen SIMD paradigm code for parts of the kernel.

The requirements

We referred the SPIRV generated by the DPCPP which is SIMT+SIMD. We need to refer the SPIRV kernel function directly thru a function pointer. https://github.com/intel/llvm/blob/sycl/sycl/doc/design/spirv-extensions/SPV_INTEL_function_pointers.asciidoc

We need the SPIRV dialect to support this.

chengjunlu commented 1 year ago

Disassemble of the SPIRV IR from the DPCPP example:

; SPIR-V
; Version: 1.4
; Generator: Khronos LLVM/SPIR-V Translator; 14
; Bound: 229
; Schema: 0
               OpCapability Addresses ; 0x00000014
               OpCapability Linkage ; 0x0000001c
               OpCapability Kernel ; 0x00000024
               OpCapability Vector16 ; 0x0000002c
               OpCapability Int64 ; 0x00000034
               OpCapability GenericPointer ; 0x0000003c
               OpCapability Int8 ; 0x00000044
               OpCapability SubgroupDispatch ; 0x0000004c
               OpCapability IndirectReferencesINTEL ; 0x00000054
               OpCapability VectorComputeINTEL ; 0x0000005c
               OpCapability ExpectAssumeKHR ; 0x00000064
               OpCapability MemoryAccessAliasingINTEL ; 0x0000006c
               OpCapability OptNoneINTEL ; 0x00000074
               OpExtension "SPV_INTEL_function_pointers" ; 0x0000007c
               OpExtension "SPV_INTEL_memory_access_aliasing" ; 0x0000009c
               OpExtension "SPV_INTEL_optnone" ; 0x000000c4
               OpExtension "SPV_INTEL_vector_compute" ; 0x000000dc
               OpExtension "SPV_KHR_expect_assume" ; 0x000000fc
          %1 = OpExtInstImport "OpenCL.std" ; 0x00000118
               OpMemoryModel Physical64 OpenCL ; 0x0000012c
               OpEntryPoint Kernel %43 "_ZTSZZ4testILb1EEbvENKUlRN4sycl3_V17handlerEE_clES3_EUlNS1_7nd_itemILi1EEEE_" ; 0x00000138
               OpExecutionMode %43 ContractionOff ; 0x00000194
               OpExecutionMode %43 SubgroupSize 16 ; 0x000001a0
               OpSource Unknown 100000 ; 0x000001b0
               OpName %__spirv_BuiltInSubgroupId "__spirv_BuiltInSubgroupId" ; 0x000001bc
               OpName %__spirv_BuiltInSubgroupLocalInvocationId "__spirv_BuiltInSubgroupLocalInvocationId" ; 0x000001e0
               OpName %__spirv_BuiltInWorkgroupId "__spirv_BuiltInWorkgroupId" ; 0x00000214
               OpName %__spirv_BuiltInGlobalLinearId "__spirv_BuiltInGlobalLinearId" ; 0x00000238
               OpName %__spirv_BuiltInWorkgroupSize "__spirv_BuiltInWorkgroupSize" ; 0x00000260
               OpName %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2 "_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2" ; 0x00000288
               OpName %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4 "_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4" ; 0x00000334
               OpName %_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6 "_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6" ; 0x000003e0
               OpName %llvm_genx_svm_block_ld_unaligned_v16f32_i64 "llvm.genx.svm.block.ld.unaligned.v16f32.i64" ; 0x00000488
               OpName %__itt_offload_wi_start_wrapper "__itt_offload_wi_start_wrapper" ; 0x000004bc
               OpName %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3" ; 0x000004e4
               OpName %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1" ; 0x000005fc
               OpName %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5" ; 0x00000714
               OpName %__itt_offload_wi_finish_wrapper "__itt_offload_wi_finish_wrapper" ; 0x00000828
               OpName %__itt_offload_wi_start_stub "__itt_offload_wi_start_stub" ; 0x00000850
               OpName %__itt_offload_wi_finish_stub "__itt_offload_wi_finish_stub" ; 0x00000874
               OpName %_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi "_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" ; 0x0000089c
               OpName %_esimd ".esimd" ; 0x0000090c
               OpName %_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi "_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" ; 0x0000091c
               OpName %_esimd_0 ".esimd" ; 0x0000098c
               OpName %_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi "_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" ; 0x0000099c
               OpName %_esimd_1 ".esimd" ; 0x00000a10
               OpName %_esimd_i ".esimd.i" ; 0x00000a20
               OpName %_esimd_i_0 ".esimd.i" ; 0x00000a34
               OpName %_esimd_i_1 ".esimd.i" ; 0x00000a48
         %52 = OpAliasDomainDeclINTEL ; 0x00000a5c
         %53 = OpAliasScopeDeclINTEL %52 ; 0x00000a64
         %54 = OpAliasDomainDeclINTEL ; 0x00000a70
         %55 = OpAliasScopeDeclINTEL %54 ; 0x00000a78
         %56 = OpAliasDomainDeclINTEL ; 0x00000a84
         %57 = OpAliasScopeDeclINTEL %56 ; 0x00000a8c
         %58 = OpAliasScopeListDeclINTEL %53 %55 %57 ; 0x00000a98
         %61 = OpAliasDomainDeclINTEL ; 0x00000aac
         %62 = OpAliasScopeDeclINTEL %61 ; 0x00000ab4
         %63 = OpAliasScopeListDeclINTEL %62 ; 0x00000ac0
         %71 = OpAliasDomainDeclINTEL ; 0x00000acc
         %72 = OpAliasScopeDeclINTEL %71 ; 0x00000ad4
         %73 = OpAliasScopeListDeclINTEL %72 ; 0x00000ae0
               OpDecorate %__spirv_BuiltInSubgroupId LinkageAttributes "__spirv_BuiltInSubgroupId" Import ; 0x00000aec
               OpDecorate %__spirv_BuiltInSubgroupId Constant ; 0x00000b18
               OpDecorate %__spirv_BuiltInSubgroupId BuiltIn SubgroupId ; 0x00000b24
               OpDecorate %__spirv_BuiltInSubgroupId Alignment 4 ; 0x00000b34
               OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId LinkageAttributes "__spirv_BuiltInSubgroupLocalInvocationId" Import ; 0x00000b44
               OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId Constant ; 0x00000b80
               OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId BuiltIn SubgroupLocalInvocationId ; 0x00000b8c
               OpDecorate %__spirv_BuiltInSubgroupLocalInvocationId Alignment 4 ; 0x00000b9c
               OpDecorate %__spirv_BuiltInWorkgroupId LinkageAttributes "__spirv_BuiltInWorkgroupId" Import ; 0x00000bac
               OpDecorate %__spirv_BuiltInWorkgroupId Constant ; 0x00000bd8
               OpDecorate %__spirv_BuiltInWorkgroupId BuiltIn WorkgroupId ; 0x00000be4
               OpDecorate %__spirv_BuiltInWorkgroupId Alignment 32 ; 0x00000bf4
               OpDecorate %__spirv_BuiltInGlobalLinearId LinkageAttributes "__spirv_BuiltInGlobalLinearId" Import ; 0x00000c04
               OpDecorate %__spirv_BuiltInGlobalLinearId Constant ; 0x00000c34
               OpDecorate %__spirv_BuiltInGlobalLinearId BuiltIn GlobalLinearId ; 0x00000c40
               OpDecorate %__spirv_BuiltInGlobalLinearId Alignment 8 ; 0x00000c50
               OpDecorate %__spirv_BuiltInWorkgroupSize LinkageAttributes "__spirv_BuiltInWorkgroupSize" Import ; 0x00000c60
               OpDecorate %__spirv_BuiltInWorkgroupSize Constant ; 0x00000c90
               OpDecorate %__spirv_BuiltInWorkgroupSize BuiltIn WorkgroupSize ; 0x00000c9c
               OpDecorate %__spirv_BuiltInWorkgroupSize Alignment 32 ; 0x00000cac
               OpDecorate %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2 LinkageAttributes "_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2" Import ; 0x00000cbc
               OpDecorate %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4 LinkageAttributes "_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4" Import ; 0x00000d70
               OpDecorate %_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6 LinkageAttributes "_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6" Import ; 0x00000e24
               OpDecorate %llvm_genx_svm_block_ld_unaligned_v16f32_i64 LinkageAttributes "llvm.genx.svm.block.ld.unaligned.v16f32.i64" Import ; 0x00000ed4
               OpDecorate %llvm_genx_svm_block_ld_unaligned_v16f32_i64 VectorComputeFunctionINTEL ; 0x00000f10
               OpDecorate %44 Alignment 4 ; 0x00000f1c
               OpDecorate %45 Alignment 4 ; 0x00000f2c
               OpDecorate %46 Alignment 4 ; 0x00000f3c
               OpDecorate %__itt_offload_wi_start_wrapper LinkageAttributes "__itt_offload_wi_start_wrapper" Export ; 0x00000f4c
               OpDecorate %77 NoSignedWrap ; 0x00000f7c
               OpDecorate %77 NoUnsignedWrap ; 0x00000f88
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 LinkageAttributes "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3" Export ; 0x00000f94
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 ReferencedIndirectlyINTEL ; 0x000010b4
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 StackCallINTEL ; 0x000010c0
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 VectorComputeFunctionINTEL ; 0x000010cc
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 LinkageAttributes "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1" Export ; 0x000010d8
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 ReferencedIndirectlyINTEL ; 0x000011f8
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 StackCallINTEL ; 0x00001204
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 VectorComputeFunctionINTEL ; 0x00001210
               OpDecorate %94 FPFastMathMode Fast ; 0x0000121c
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 LinkageAttributes "_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5" Export ; 0x0000122c
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 ReferencedIndirectlyINTEL ; 0x00001348
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 StackCallINTEL ; 0x00001354
               OpDecorate %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 VectorComputeFunctionINTEL ; 0x00001360
               OpDecorate %__itt_offload_wi_finish_wrapper LinkageAttributes "__itt_offload_wi_finish_wrapper" Export ; 0x0000136c
               OpDecorate %110 Alignment 8 ; 0x0000139c
               OpDecorate %112 SpecId 4285822057 ; 0x000013ac
               OpDecorate %__itt_offload_wi_start_stub LinkageAttributes "__itt_offload_wi_start_stub" Export ; 0x000013bc
               OpDecorate %147 Alignment 8 ; 0x000013e8
               OpDecorate %148 SpecId 4285822057 ; 0x000013f8
               OpDecorate %__itt_offload_wi_finish_stub LinkageAttributes "__itt_offload_wi_finish_stub" Export ; 0x00001408
               OpDecorate %167 Alignment 8 ; 0x00001438
               OpDecorate %168 Alignment 8 ; 0x00001448
               OpDecorate %170 Alignment 4 ; 0x00001458
               OpDecorate %177 Alignment 8 ; 0x00001468
               OpDecorate %178 Alignment 8 ; 0x00001478
               OpDecorate %_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi LinkageAttributes "_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" Export ; 0x00001488
               OpDecorate %_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi ReferencedIndirectlyINTEL ; 0x00001500
               OpDecorate %_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi VectorComputeFunctionINTEL ; 0x0000150c
               OpDecorate %190 FPFastMathMode Fast ; 0x00001518
               OpDecorate %_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi LinkageAttributes "_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" Export ; 0x00001528
               OpDecorate %_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi ReferencedIndirectlyINTEL ; 0x000015a0
               OpDecorate %_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi VectorComputeFunctionINTEL ; 0x000015ac
               OpDecorate %200 FPFastMathMode Fast ; 0x000015b8
               OpDecorate %201 FPFastMathMode Fast ; 0x000015c8
               OpDecorate %_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi LinkageAttributes "_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi" Export ; 0x000015d8
               OpDecorate %_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi ReferencedIndirectlyINTEL ; 0x00001654
               OpDecorate %_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi VectorComputeFunctionINTEL ; 0x00001660
               OpDecorate %216 FPFastMathMode Fast ; 0x0000166c
               OpDecorate %217 FPFastMathMode Fast ; 0x0000167c
               OpDecorate %223 FPFastMathMode Fast ; 0x0000168c
       %uint = OpTypeInt 32 0 ; 0x0000169c
      %ulong = OpTypeInt 64 0 ; 0x000016ac
      %uchar = OpTypeInt 8 0 ; 0x000016bc
     %uint_4 = OpConstant %uint 4 ; 0x000016cc
     %uint_6 = OpConstant %uint 6 ; 0x000016dc
%ulong_2147483648 = OpConstant %ulong 2147483648 ; 0x000016ec
    %ulong_3 = OpConstant %ulong 3 ; 0x00001700
        %112 = OpSpecConstant %uchar 0 ; 0x00001714
    %uchar_0 = OpConstant %uchar 0 ; 0x00001724
    %ulong_0 = OpConstant %ulong 0 ; 0x00001734
    %ulong_1 = OpConstant %ulong 1 ; 0x00001748
    %ulong_2 = OpConstant %ulong 2 ; 0x0000175c
        %148 = OpSpecConstant %uchar 0 ; 0x00001770
%_ptr_CrossWorkgroup_uint = OpTypePointer CrossWorkgroup %uint ; 0x00001780
    %v3ulong = OpTypeVector %ulong 3 ; 0x00001790
%_ptr_CrossWorkgroup_v3ulong = OpTypePointer CrossWorkgroup %v3ulong ; 0x000017a0
%_ptr_CrossWorkgroup_ulong = OpTypePointer CrossWorkgroup %ulong ; 0x000017b0
      %float = OpTypeFloat 32 ; 0x000017c0
   %v16float = OpTypeVector %float 16 ; 0x000017cc
%_ptr_Generic_float = OpTypePointer Generic %float ; 0x000017dc
         %16 = OpTypeFunction %v16float %_ptr_Generic_float %v16float %uint ; 0x000017ec
%_ptr_Function_16 = OpTypePointer Function %16 ; 0x00001804
         %18 = OpTypeFunction %float %_ptr_Function_16 %_ptr_Generic_float %float %uint ; 0x00001814
       %void = OpTypeVoid ; 0x00001830
         %30 = OpTypeFunction %void %_ptr_Generic_float %v16float %uint ; 0x00001838
%_ptr_Function_30 = OpTypePointer Function %30 ; 0x00001850
         %32 = OpTypeFunction %void %_ptr_Function_30 %_ptr_Generic_float %float %uint ; 0x00001860
         %38 = OpTypeFunction %v16float %ulong ; 0x0000187c
%_ptr_CrossWorkgroup_float = OpTypePointer CrossWorkgroup %float ; 0x0000188c
         %42 = OpTypeFunction %void %_ptr_CrossWorkgroup_float %_ptr_CrossWorkgroup_float %_ptr_CrossWorkgroup_float ; 0x0000189c
         %48 = OpTypeFunction %void ; 0x000018b4
       %bool = OpTypeBool ; 0x000018c0
%_arr_ulong_ulong_3 = OpTypeArray %ulong %ulong_3 ; 0x000018c8
%_ptr_Function__arr_ulong_ulong_3 = OpTypePointer Function %_arr_ulong_ulong_3 ; 0x000018d8
%_ptr_Function_uchar = OpTypePointer Function %uchar ; 0x000018e8
%_ptr_Function_ulong = OpTypePointer Function %ulong ; 0x000018f8
%_ptr_Generic_ulong = OpTypePointer Generic %ulong ; 0x00001908
        %138 = OpTypeFunction %void %_ptr_Generic_ulong %ulong %uint ; 0x00001918
        %160 = OpTypeFunction %void %_ptr_Generic_ulong %ulong ; 0x00001930
%_ptr_Function__ptr_Generic_ulong = OpTypePointer Function %_ptr_Generic_ulong ; 0x00001944
%_ptr_Function_uint = OpTypePointer Function %uint ; 0x00001954
%_ptr_Generic__ptr_Generic_ulong = OpTypePointer Generic %_ptr_Generic_ulong ; 0x00001964
%_ptr_Generic_uint = OpTypePointer Generic %uint ; 0x00001974
%__spirv_BuiltInSubgroupId = OpVariable %_ptr_CrossWorkgroup_uint CrossWorkgroup ; 0x00001984
%__spirv_BuiltInSubgroupLocalInvocationId = OpVariable %_ptr_CrossWorkgroup_uint CrossWorkgroup ; 0x00001994
%__spirv_BuiltInWorkgroupId = OpVariable %_ptr_CrossWorkgroup_v3ulong CrossWorkgroup ; 0x000019a4
%__spirv_BuiltInGlobalLinearId = OpVariable %_ptr_CrossWorkgroup_ulong CrossWorkgroup ; 0x000019b4
%__spirv_BuiltInWorkgroupSize = OpVariable %_ptr_CrossWorkgroup_v3ulong CrossWorkgroup ; 0x000019c4
%_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2 = OpFunction %float None %18 ; 0x000019d4
         %20 = OpFunctionParameter %_ptr_Function_16 ; 0x000019e8
         %21 = OpFunctionParameter %_ptr_Generic_float ; 0x000019f4
         %22 = OpFunctionParameter %float ; 0x00001a00
         %23 = OpFunctionParameter %uint ; 0x00001a0c
               OpFunctionEnd ; 0x00001a18
%_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4 = OpFunction %float None %18 ; 0x00001a1c
         %25 = OpFunctionParameter %_ptr_Function_16 ; 0x00001a30
         %26 = OpFunctionParameter %_ptr_Generic_float ; 0x00001a3c
         %27 = OpFunctionParameter %float ; 0x00001a48
         %28 = OpFunctionParameter %uint ; 0x00001a54
               OpFunctionEnd ; 0x00001a60
%_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6 = OpFunction %void None %32 ; 0x00001a64
         %34 = OpFunctionParameter %_ptr_Function_30 ; 0x00001a78
         %35 = OpFunctionParameter %_ptr_Generic_float ; 0x00001a84
         %36 = OpFunctionParameter %float ; 0x00001a90
         %37 = OpFunctionParameter %uint ; 0x00001a9c
               OpFunctionEnd ; 0x00001aa8
%llvm_genx_svm_block_ld_unaligned_v16f32_i64 = OpFunction %v16float Inline %38 ; 0x00001aac
         %40 = OpFunctionParameter %ulong ; 0x00001ac0
               OpFunctionEnd ; 0x00001acc
         %43 = OpFunction %void None %42 ; 0x00001ad0
         %44 = OpFunctionParameter %_ptr_CrossWorkgroup_float ; 0x00001ae4
         %45 = OpFunctionParameter %_ptr_CrossWorkgroup_float ; 0x00001af0
         %46 = OpFunctionParameter %_ptr_CrossWorkgroup_float ; 0x00001afc
         %47 = OpLabel ; 0x00001b08
         %50 = OpFunctionCall %void %__itt_offload_wi_start_wrapper ; 0x00001b10
         %51 = OpPtrCastToGeneric %_ptr_Generic_float %44 ; 0x00001b20
         %59 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupId Aligned|NoAliasINTELMask 32 %58 ; 0x00001b30
         %60 = OpCompositeExtract %ulong %59 0 ; 0x00001b4c
         %64 = OpLoad %uint %__spirv_BuiltInSubgroupId Aligned|NoAliasINTELMask 4 %63 ; 0x00001b60
         %66 = OpShiftLeftLogical %uint %64 %uint_4 ; 0x00001b7c
         %67 = OpUConvert %uint %60 ; 0x00001b90
         %69 = OpShiftLeftLogical %uint %67 %uint_6 ; 0x00001ba0
         %70 = OpIAdd %uint %69 %66 ; 0x00001bb4
         %74 = OpLoad %uint %__spirv_BuiltInSubgroupLocalInvocationId Aligned|NoAliasINTELMask 4 %73 ; 0x00001bc8
         %75 = OpUConvert %ulong %74 ; 0x00001be4
         %76 = OpUConvert %ulong %70 ; 0x00001bf4
         %77 = OpIAdd %ulong %75 %76 ; 0x00001c04
         %80 = OpULessThan %bool %77 %ulong_2147483648 ; 0x00001c18
               OpAssumeTrueKHR %80 ; 0x00001c2c
         %81 = OpInBoundsPtrAccessChain %_ptr_CrossWorkgroup_float %45 %77 ; 0x00001c34
         %82 = OpLoad %float %81 Aligned 4 ; 0x00001c48
         %87 = OpFunctionCall %float %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__4 %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 %51 %82 %70 ; 0x00001c60
         %88 = OpLoad %float %81 Aligned 4 ; 0x00001c80
         %93 = OpFunctionCall %float %_Z33__regcall3____builtin_invoke_simdILb1EfPFNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEERFS5_PfS5_iES6_S5_jEJPS7_S6_fjEvET0_T1_DpT2__2 %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 %51 %88 %70 ; 0x00001c98
         %94 = OpFAdd %float %87 %93 ; 0x00001cb8
         %95 = OpLoad %float %81 Aligned 4 ; 0x00001ccc
        %100 = OpFunctionCall %void %_Z33__regcall3____builtin_invoke_simdILb1EvPFvRFvPfNSt12experimental4simdIfNS1_10__simd_abiILNS1_12_StorageKindE2ELi16EEEEEiES0_S6_jEJPS7_S0_fjEvET0_T1_DpT2__6 %_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 %51 %95 %70 ; 0x00001ce4
        %101 = OpInBoundsPtrAccessChain %_ptr_CrossWorkgroup_float %46 %77 ; 0x00001d04
               OpStore %101 %94 Aligned 4 ; 0x00001d18
        %103 = OpFunctionCall %void %__itt_offload_wi_finish_wrapper ; 0x00001d2c
               OpReturn ; 0x00001d3c
               OpFunctionEnd ; 0x00001d40
%__itt_offload_wi_start_wrapper = OpFunction %void Inline %48 ; 0x00001d44
        %104 = OpLabel ; 0x00001d58
        %110 = OpVariable %_ptr_Function__arr_ulong_ulong_3 Function ; 0x00001d60
        %114 = OpIEqual %bool %112 %uchar_0 ; 0x00001d70
               OpBranchConditional %114 %106 %105 ; 0x00001d84
        %105 = OpLabel ; 0x00001d94
        %116 = OpBitcast %_ptr_Function_uchar %110 ; 0x00001d9c
               OpLifetimeStart %116 24 ; 0x00001dac
        %119 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %110 %ulong_0 %ulong_0 ; 0x00001db8
        %121 = OpPtrCastToGeneric %_ptr_Generic_ulong %119 ; 0x00001dd0
        %122 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupId Aligned 32 ; 0x00001de0
        %123 = OpCompositeExtract %ulong %122 0 ; 0x00001df8
               OpStore %119 %123 Aligned 8 ; 0x00001e0c
        %125 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %110 %ulong_0 %ulong_1 ; 0x00001e20
        %126 = OpCompositeExtract %ulong %122 1 ; 0x00001e38
               OpStore %125 %126 Aligned 8 ; 0x00001e4c
        %128 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %110 %ulong_0 %ulong_2 ; 0x00001e60
        %129 = OpCompositeExtract %ulong %122 2 ; 0x00001e78
               OpStore %128 %129 Aligned 8 ; 0x00001e8c
        %130 = OpLoad %ulong %__spirv_BuiltInGlobalLinearId Aligned 8 ; 0x00001ea0
        %131 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupSize Aligned 32 ; 0x00001eb8
        %132 = OpCompositeExtract %ulong %131 0 ; 0x00001ed0
        %133 = OpCompositeExtract %ulong %131 1 ; 0x00001ee4
        %134 = OpIMul %ulong %132 %133 ; 0x00001ef8
        %135 = OpCompositeExtract %ulong %131 2 ; 0x00001f0c
        %136 = OpIMul %ulong %134 %135 ; 0x00001f20
        %137 = OpUConvert %uint %136 ; 0x00001f34
        %143 = OpFunctionCall %void %__itt_offload_wi_start_stub %121 %130 %137 ; 0x00001f44
               OpLifetimeStop %116 24 ; 0x00001f60
               OpBranch %106 ; 0x00001f6c
        %106 = OpLabel ; 0x00001f74
               OpReturn ; 0x00001f7c
               OpFunctionEnd ; 0x00001f80
%_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__3 = OpFunction %v16float None %16 ; 0x00001f84
         %84 = OpFunctionParameter %_ptr_Generic_float ; 0x00001f98
         %85 = OpFunctionParameter %v16float ; 0x00001fa4
         %86 = OpFunctionParameter %uint ; 0x00001fb0
        %218 = OpLabel ; 0x00001fbc
        %219 = OpSConvert %ulong %86 ; 0x00001fc4
        %220 = OpInBoundsPtrAccessChain %_ptr_Generic_float %84 %219 ; 0x00001fd4
        %221 = OpConvertPtrToU %ulong %220 ; 0x00001fe8
 %_esimd_i_0 = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %221 ; 0x00001ff8
        %223 = OpFAdd %v16float %_esimd_i_0 %85 ; 0x0000200c
               OpReturnValue %223 ; 0x00002020
               OpFunctionEnd ; 0x00002028
%_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFNSt12experimental4simdIfNS6_10__simd_abiILNS6_12_StorageKindE2ELi16EEEEEPfSB_iEJNS3_7uniformISC_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__1 = OpFunction %v16float None %16 ; 0x0000202c
         %90 = OpFunctionParameter %_ptr_Generic_float ; 0x00002040
         %91 = OpFunctionParameter %v16float ; 0x0000204c
         %92 = OpFunctionParameter %uint ; 0x00002058
        %211 = OpLabel ; 0x00002064
        %212 = OpSConvert %ulong %92 ; 0x0000206c
        %213 = OpInBoundsPtrAccessChain %_ptr_Generic_float %90 %212 ; 0x0000207c
        %214 = OpConvertPtrToU %ulong %213 ; 0x00002090
   %_esimd_i = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %214 ; 0x000020a0
        %216 = OpFAdd %v16float %_esimd_i %91 ; 0x000020b4
        %217 = OpFAdd %v16float %216 %216 ; 0x000020c8
               OpReturnValue %217 ; 0x000020dc
               OpFunctionEnd ; 0x000020e4
%_ZN4sycl3_V13ext6oneapi12experimental6detail33__regcall3__simd_func_call_helperILi16ERFvPfNSt12experimental4simdIfNS7_10__simd_abiILNS7_12_StorageKindE2ELi16EEEEEiEJNS3_7uniformIS6_EEfNSF_IjEEEEENSt13invoke_resultIT0_JDpNS4_9spmd2simdIT1_XT_EvE4typeEEE4typeESJ_SO__5 = OpFunction %void None %30 ; 0x000020e8
         %97 = OpFunctionParameter %_ptr_Generic_float ; 0x000020fc
         %98 = OpFunctionParameter %v16float ; 0x00002108
         %99 = OpFunctionParameter %uint ; 0x00002114
        %224 = OpLabel ; 0x00002120
        %225 = OpSConvert %ulong %99 ; 0x00002128
        %226 = OpInBoundsPtrAccessChain %_ptr_Generic_float %97 %225 ; 0x00002138
        %227 = OpConvertPtrToU %ulong %226 ; 0x0000214c
 %_esimd_i_1 = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %227 ; 0x0000215c
               OpReturn ; 0x00002170
               OpFunctionEnd ; 0x00002174
%__itt_offload_wi_finish_wrapper = OpFunction %void Inline %48 ; 0x00002178
        %144 = OpLabel ; 0x0000218c
        %147 = OpVariable %_ptr_Function__arr_ulong_ulong_3 Function ; 0x00002194
        %149 = OpIEqual %bool %148 %uchar_0 ; 0x000021a4
               OpBranchConditional %149 %146 %145 ; 0x000021b8
        %145 = OpLabel ; 0x000021c8
        %150 = OpBitcast %_ptr_Function_uchar %147 ; 0x000021d0
               OpLifetimeStart %150 24 ; 0x000021e0
        %151 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %147 %ulong_0 %ulong_0 ; 0x000021ec
        %152 = OpPtrCastToGeneric %_ptr_Generic_ulong %151 ; 0x00002204
        %153 = OpLoad %v3ulong %__spirv_BuiltInWorkgroupId Aligned 32 ; 0x00002214
        %154 = OpCompositeExtract %ulong %153 0 ; 0x0000222c
               OpStore %151 %154 Aligned 8 ; 0x00002240
        %155 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %147 %ulong_0 %ulong_1 ; 0x00002254
        %156 = OpCompositeExtract %ulong %153 1 ; 0x0000226c
               OpStore %155 %156 Aligned 8 ; 0x00002280
        %157 = OpInBoundsPtrAccessChain %_ptr_Function_ulong %147 %ulong_0 %ulong_2 ; 0x00002294
        %158 = OpCompositeExtract %ulong %153 2 ; 0x000022ac
               OpStore %157 %158 Aligned 8 ; 0x000022c0
        %159 = OpLoad %ulong %__spirv_BuiltInGlobalLinearId Aligned 8 ; 0x000022d4
        %164 = OpFunctionCall %void %__itt_offload_wi_finish_stub %152 %159 ; 0x000022ec
               OpLifetimeStop %150 24 ; 0x00002304
               OpBranch %146 ; 0x00002310
        %146 = OpLabel ; 0x00002318
               OpReturn ; 0x00002320
               OpFunctionEnd ; 0x00002324
%__itt_offload_wi_start_stub = OpFunction %void DontInline|OptNoneINTEL %138 ; 0x00002328
        %140 = OpFunctionParameter %_ptr_Generic_ulong ; 0x0000233c
        %141 = OpFunctionParameter %ulong ; 0x00002348
        %142 = OpFunctionParameter %uint ; 0x00002354
        %165 = OpLabel ; 0x00002360
        %167 = OpVariable %_ptr_Function__ptr_Generic_ulong Function ; 0x00002368
        %168 = OpVariable %_ptr_Function_ulong Function ; 0x00002378
        %170 = OpVariable %_ptr_Function_uint Function ; 0x00002388
        %172 = OpPtrCastToGeneric %_ptr_Generic__ptr_Generic_ulong %167 ; 0x00002398
        %173 = OpPtrCastToGeneric %_ptr_Generic_ulong %168 ; 0x000023a8
        %175 = OpPtrCastToGeneric %_ptr_Generic_uint %170 ; 0x000023b8
               OpStore %172 %140 Aligned 8 ; 0x000023c8
               OpStore %173 %141 Aligned 8 ; 0x000023dc
               OpStore %175 %142 Aligned 4 ; 0x000023f0
               OpReturn ; 0x00002404
               OpFunctionEnd ; 0x00002408
%__itt_offload_wi_finish_stub = OpFunction %void DontInline|OptNoneINTEL %160 ; 0x0000240c
        %162 = OpFunctionParameter %_ptr_Generic_ulong ; 0x00002420
        %163 = OpFunctionParameter %ulong ; 0x0000242c
        %176 = OpLabel ; 0x00002438
        %177 = OpVariable %_ptr_Function__ptr_Generic_ulong Function ; 0x00002440
        %178 = OpVariable %_ptr_Function_ulong Function ; 0x00002450
        %179 = OpPtrCastToGeneric %_ptr_Generic__ptr_Generic_ulong %177 ; 0x00002460
        %180 = OpPtrCastToGeneric %_ptr_Generic_ulong %178 ; 0x00002470
               OpStore %179 %162 Aligned 8 ; 0x00002480
               OpStore %180 %163 Aligned 8 ; 0x00002494
               OpReturn ; 0x000024a8
               OpFunctionEnd ; 0x000024ac
%_Z24__regcall3__SIMD_CALLEE1PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi = OpFunction %v16float Inline %16 ; 0x000024b0
        %182 = OpFunctionParameter %_ptr_Generic_float ; 0x000024c4
        %183 = OpFunctionParameter %v16float ; 0x000024d0
        %184 = OpFunctionParameter %uint ; 0x000024dc
        %185 = OpLabel ; 0x000024e8
        %186 = OpSConvert %ulong %184 ; 0x000024f0
        %187 = OpInBoundsPtrAccessChain %_ptr_Generic_float %182 %186 ; 0x00002500
        %188 = OpConvertPtrToU %ulong %187 ; 0x00002514
     %_esimd = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %188 ; 0x00002524
        %190 = OpFAdd %v16float %_esimd %183 ; 0x00002538
               OpReturnValue %190 ; 0x0000254c
               OpFunctionEnd ; 0x00002554
%_Z24__regcall3__SIMD_CALLEE2PfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi = OpFunction %v16float Inline %16 ; 0x00002558
        %192 = OpFunctionParameter %_ptr_Generic_float ; 0x0000256c
        %193 = OpFunctionParameter %v16float ; 0x00002578
        %194 = OpFunctionParameter %uint ; 0x00002584
        %195 = OpLabel ; 0x00002590
        %196 = OpSConvert %ulong %194 ; 0x00002598
        %197 = OpInBoundsPtrAccessChain %_ptr_Generic_float %192 %196 ; 0x000025a8
        %198 = OpConvertPtrToU %ulong %197 ; 0x000025bc
   %_esimd_0 = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %198 ; 0x000025cc
        %200 = OpFAdd %v16float %_esimd_0 %193 ; 0x000025e0
        %201 = OpFAdd %v16float %200 %200 ; 0x000025f4
               OpReturnValue %201 ; 0x00002608
               OpFunctionEnd ; 0x00002610
%_Z28__regcall3__SIMD_CALLEE_VOIDPfNSt12experimental4simdIfNS0_10__simd_abiILNS0_12_StorageKindE2ELi16EEEEEi = OpFunction %void Inline %30 ; 0x00002614
        %203 = OpFunctionParameter %_ptr_Generic_float ; 0x00002628
        %204 = OpFunctionParameter %v16float ; 0x00002634
        %205 = OpFunctionParameter %uint ; 0x00002640
        %206 = OpLabel ; 0x0000264c
        %207 = OpSConvert %ulong %205 ; 0x00002654
        %208 = OpInBoundsPtrAccessChain %_ptr_Generic_float %203 %207 ; 0x00002664
        %209 = OpConvertPtrToU %ulong %208 ; 0x00002678
   %_esimd_1 = OpFunctionCall %v16float %llvm_genx_svm_block_ld_unaligned_v16f32_i64 %209 ; 0x00002688
               OpReturn ; 0x0000269c
               OpFunctionEnd ; 0x000026a0
chengjunlu commented 1 year ago

SPIRV dialect we are working on:

// -----// IR Dump After CSE (cse) //----- //
module attributes {spirv.target_env = #spirv.target_env<#spirv.vce<v1.4, [Addresses, Float16Buffer, Int64, Int16, Int8, Kernel, Linkage, Vector16, GenericPointer, Groups, Float16, Float64, AtomicFloat32AddEXT, ExpectAssumeKHR], [SPV_EXT_shader_atomic_float_add, SPV_KHR_expect_assume]>, api=OpenCL, #spirv.resource_limits<>>, "triton_gpu.num-warps" = 1 : i32, triton_gpu.shared = 2048 : i32, "triton_gpu.threads-per-warp" = 32 : i32} {
  spirv.GlobalVariable @__builtin_var_LocalInvocationId__ built_in("LocalInvocationId") : !spirv.ptr<vector<3xi64>, Input>
  spirv.GlobalVariable @__builtin_var_WorkgroupId__ built_in("WorkgroupId") : !spirv.ptr<vector<3xi64>, Input>
  spirv.func @SIMDwrapper(%arg0: vector<16xi32>) -> vector<16xi32> "None" attributes {referenced_indirectly_i_n_t_e_l, stack_call_i_n_t_e_l, vector_compute_function_i_n_t_e_l} {
    %0 = spirv.Undef : vector<16xi32>
    %cst0_i32 = spirv.Constant 0 : i32
    %1 = spirv.VectorInsertDynamic %cst0_i32, %0[%cst0_i32] : vector<16xi32>, i32
    %cst1_i32 = spirv.Constant 1 : i32
    %2 = spirv.VectorInsertDynamic %cst0_i32, %1[%cst1_i32] : vector<16xi32>, i32
    %cst2_i32 = spirv.Constant 2 : i32
    %3 = spirv.VectorInsertDynamic %cst0_i32, %2[%cst2_i32] : vector<16xi32>, i32
    %cst3_i32 = spirv.Constant 3 : i32
    %4 = spirv.VectorInsertDynamic %cst0_i32, %3[%cst3_i32] : vector<16xi32>, i32
    %cst4_i32 = spirv.Constant 4 : i32
    %5 = spirv.VectorInsertDynamic %cst0_i32, %4[%cst4_i32] : vector<16xi32>, i32
    %cst5_i32 = spirv.Constant 5 : i32
    %6 = spirv.VectorInsertDynamic %cst0_i32, %5[%cst5_i32] : vector<16xi32>, i32
    %cst6_i32 = spirv.Constant 6 : i32
    %7 = spirv.VectorInsertDynamic %cst0_i32, %6[%cst6_i32] : vector<16xi32>, i32
    %cst7_i32 = spirv.Constant 7 : i32
    %8 = spirv.VectorInsertDynamic %cst0_i32, %7[%cst7_i32] : vector<16xi32>, i32
    %cst8_i32 = spirv.Constant 8 : i32
    %9 = spirv.VectorInsertDynamic %cst0_i32, %8[%cst8_i32] : vector<16xi32>, i32
    %cst9_i32 = spirv.Constant 9 : i32
    %10 = spirv.VectorInsertDynamic %cst0_i32, %9[%cst9_i32] : vector<16xi32>, i32
    %cst10_i32 = spirv.Constant 10 : i32
    %11 = spirv.VectorInsertDynamic %cst0_i32, %10[%cst10_i32] : vector<16xi32>, i32
    %cst11_i32 = spirv.Constant 11 : i32
    %12 = spirv.VectorInsertDynamic %cst0_i32, %11[%cst11_i32] : vector<16xi32>, i32
    %cst12_i32 = spirv.Constant 12 : i32
    %13 = spirv.VectorInsertDynamic %cst0_i32, %12[%cst12_i32] : vector<16xi32>, i32
    %cst13_i32 = spirv.Constant 13 : i32
    %14 = spirv.VectorInsertDynamic %cst0_i32, %13[%cst13_i32] : vector<16xi32>, i32
    %cst14_i32 = spirv.Constant 14 : i32
    %15 = spirv.VectorInsertDynamic %cst0_i32, %14[%cst14_i32] : vector<16xi32>, i32
    %cst15_i32 = spirv.Constant 15 : i32
    %16 = spirv.VectorInsertDynamic %cst0_i32, %15[%cst15_i32] : vector<16xi32>, i32
    spirv.ReturnValue %16 : vector<16xi32>
  }
  spirv.func @_Z33__regcall3____builtin_invoke_simdSIMDwrapper(!spirv.ptr<(vector<16xi32>) -> vector<16xi32>, CodeSectionINTEL>, i32) -> i32 "Inline" attributes {libname = "libdevice", libpath = "", linkage_attributes = ["_Z33__regcall3____builtin_invoke_simdSIMDwrapper", "Import"]}
  spirv.func @_kernel_0d1d2d3d4d5d6d7c8d9c10d11c(%arg0: !spirv.ptr<f16, CrossWorkgroup> {tt.divisibility = 16 : i32}, %arg1: !spirv.ptr<f16, CrossWorkgroup> {tt.divisibility = 16 : i32}, %arg2: !spirv.ptr<f16, CrossWorkgroup> {tt.divisibility = 16 : i32}, %arg3: i32 {tt.divisibility = 16 : i32}, %arg4: i32 {tt.divisibility = 16 : i32}, %arg5: i32 {tt.divisibility = 16 : i32}, %arg6: i32 {tt.divisibility = 16 : i32}, %arg7: i32 {tt.divisibility = 16 : i32}, %arg8: i32 {tt.divisibility = 16 : i32}, %arg9: !spirv.ptr<i8, Workgroup>) "None" attributes {noinline = false, spirv.entry_point_abi = #spirv.entry_point_abi<>, sym_visibility = "public"} {
    %__builtin_var_LocalInvocationId___addr = spirv.mlir.addressof @__builtin_var_LocalInvocationId__ : !spirv.ptr<vector<3xi64>, Input>
    %0 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %1 = spirv.CompositeExtract %0[0 : i32] : vector<3xi64>
    %2 = spirv.SConvert %1 : i64 to i32
    %cst32_i32 = spirv.Constant 32 : i32
    %3 = spirv.UMod %2, %cst32_i32 : i32
    %4 = spirv.UDiv %2, %cst32_i32 : i32
    %cst1_i32 = spirv.Constant 1 : i32
    %5 = spirv.UMod %4, %cst1_i32 : i32
    %6 = spirv.UDiv %4, %cst1_i32 : i32
    %7 = spirv.UMod %6, %cst1_i32 : i32
    %cst2_i32 = spirv.Constant 2 : i32
    %8 = spirv.UMod %3, %cst2_i32 : i32
    %9 = spirv.UDiv %3, %cst2_i32 : i32
    %cst16_i32 = spirv.Constant 16 : i32
    %10 = spirv.UMod %9, %cst16_i32 : i32
    %11 = spirv.UMod %7, %cst1_i32 : i32
    %12 = spirv.UMod %10, %cst1_i32 : i32
    %13 = spirv.IMul %11, %cst16_i32 : i32
    %14 = spirv.IAdd %12, %13 : i32
    %15 = spirv.UMod %5, %cst1_i32 : i32
    %16 = spirv.UMod %8, %cst2_i32 : i32
    %cst8_i32 = spirv.Constant 8 : i32
    %17 = spirv.IMul %15, %cst2_i32 : i32
    %18 = spirv.IAdd %16, %17 : i32
    %19 = spirv.IMul %cst8_i32, %18 : i32
    %20 = spirv.IAdd %19, %cst1_i32 : i32
    %21 = spirv.IAdd %19, %cst2_i32 : i32
    %cst3_i32 = spirv.Constant 3 : i32
    %22 = spirv.IAdd %19, %cst3_i32 : i32
    %cst4_i32 = spirv.Constant 4 : i32
    %23 = spirv.IAdd %19, %cst4_i32 : i32
    %cst5_i32 = spirv.Constant 5 : i32
    %24 = spirv.IAdd %19, %cst5_i32 : i32
    %cst6_i32 = spirv.Constant 6 : i32
    %25 = spirv.IAdd %19, %cst6_i32 : i32
    %cst7_i32 = spirv.Constant 7 : i32
    %26 = spirv.IAdd %19, %cst7_i32 : i32
    %27 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %28 = spirv.CompositeExtract %27[0 : i32] : vector<3xi64>
    %29 = spirv.SConvert %28 : i64 to i32
    %30 = spirv.UMod %29, %cst32_i32 : i32
    %31 = spirv.UDiv %29, %cst32_i32 : i32
    %32 = spirv.UMod %31, %cst1_i32 : i32
    %33 = spirv.UDiv %31, %cst1_i32 : i32
    %34 = spirv.UMod %33, %cst1_i32 : i32
    %35 = spirv.UMod %30, %cst2_i32 : i32
    %36 = spirv.UDiv %30, %cst2_i32 : i32
    %37 = spirv.UMod %36, %cst16_i32 : i32
    %38 = spirv.UMod %34, %cst1_i32 : i32
    %39 = spirv.UMod %37, %cst16_i32 : i32
    %40 = spirv.IMul %38, %cst16_i32 : i32
    %41 = spirv.IAdd %39, %40 : i32
    %42 = spirv.IMul %cst1_i32, %41 : i32
    %43 = spirv.UMod %32, %cst1_i32 : i32
    %44 = spirv.UMod %35, %cst1_i32 : i32
    %45 = spirv.IMul %43, %cst2_i32 : i32
    %46 = spirv.IAdd %44, %45 : i32
    %47 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %48 = spirv.CompositeExtract %47[0 : i32] : vector<3xi64>
    %49 = spirv.SConvert %48 : i64 to i32
    %50 = spirv.UMod %49, %cst32_i32 : i32
    %51 = spirv.UDiv %49, %cst32_i32 : i32
    %52 = spirv.UMod %51, %cst1_i32 : i32
    %53 = spirv.UDiv %51, %cst1_i32 : i32
    %54 = spirv.UMod %53, %cst1_i32 : i32
    %55 = spirv.UMod %50, %cst2_i32 : i32
    %56 = spirv.UDiv %50, %cst2_i32 : i32
    %57 = spirv.UMod %56, %cst16_i32 : i32
    %58 = spirv.UMod %54, %cst1_i32 : i32
    %59 = spirv.UMod %57, %cst16_i32 : i32
    %60 = spirv.IMul %58, %cst16_i32 : i32
    %61 = spirv.IAdd %59, %60 : i32
    %62 = spirv.IMul %cst1_i32, %61 : i32
    %63 = spirv.UMod %52, %cst1_i32 : i32
    %64 = spirv.UMod %55, %cst2_i32 : i32
    %65 = spirv.IMul %63, %cst2_i32 : i32
    %66 = spirv.IAdd %64, %65 : i32
    %67 = spirv.IMul %cst8_i32, %66 : i32
    %68 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %69 = spirv.CompositeExtract %68[0 : i32] : vector<3xi64>
    %70 = spirv.SConvert %69 : i64 to i32
    %71 = spirv.UMod %70, %cst32_i32 : i32
    %72 = spirv.UDiv %70, %cst32_i32 : i32
    %73 = spirv.UMod %72, %cst1_i32 : i32
    %74 = spirv.UDiv %72, %cst1_i32 : i32
    %75 = spirv.UMod %74, %cst1_i32 : i32
    %76 = spirv.UMod %71, %cst2_i32 : i32
    %77 = spirv.UDiv %71, %cst2_i32 : i32
    %78 = spirv.UMod %77, %cst16_i32 : i32
    %79 = spirv.UMod %75, %cst1_i32 : i32
    %80 = spirv.UMod %78, %cst16_i32 : i32
    %81 = spirv.IMul %79, %cst16_i32 : i32
    %82 = spirv.IAdd %80, %81 : i32
    %83 = spirv.IMul %cst1_i32, %82 : i32
    %84 = spirv.UMod %73, %cst1_i32 : i32
    %85 = spirv.UMod %76, %cst2_i32 : i32
    %86 = spirv.IMul %84, %cst2_i32 : i32
    %87 = spirv.IAdd %85, %86 : i32
    %88 = spirv.IMul %cst8_i32, %87 : i32
    %cst_f32 = spirv.Constant 0.000000e+00 : f32
    %89 = spirv.Undef : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %90 = spirv.CompositeInsert %cst_f32, %89[0 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %91 = spirv.CompositeInsert %cst_f32, %90[1 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %92 = spirv.CompositeInsert %cst_f32, %91[2 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %93 = spirv.CompositeInsert %cst_f32, %92[3 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %94 = spirv.CompositeInsert %cst_f32, %93[4 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %95 = spirv.CompositeInsert %cst_f32, %94[5 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %96 = spirv.CompositeInsert %cst_f32, %95[6 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %97 = spirv.CompositeInsert %cst_f32, %96[7 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %98 = spirv.Undef : !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %99 = spirv.CompositeInsert %cst16_i32, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %100 = spirv.CompositeInsert %cst16_i32, %99[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %101 = spirv.CompositeInsert %cst16_i32, %100[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %102 = spirv.CompositeInsert %cst16_i32, %101[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %103 = spirv.CompositeInsert %cst16_i32, %102[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %104 = spirv.CompositeInsert %cst16_i32, %103[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %105 = spirv.CompositeInsert %cst16_i32, %104[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %cst15_i32 = spirv.Constant 15 : i32
    %cst0_i32 = spirv.Constant 0 : i32
    %__builtin_var_WorkgroupId___addr = spirv.mlir.addressof @__builtin_var_WorkgroupId__ : !spirv.ptr<vector<3xi64>, Input>
    %106 = spirv.Load "Input" %__builtin_var_WorkgroupId___addr : vector<3xi64>
    %107 = spirv.CompositeExtract %106[0 : i32] : vector<3xi64>
    %108 = spirv.SConvert %107 : i64 to i32
    %109 = spirv.Load "Input" %__builtin_var_WorkgroupId___addr : vector<3xi64>
    %110 = spirv.CompositeExtract %109[1 : i32] : vector<3xi64>
    %111 = spirv.SConvert %110 : i64 to i32
    %112 = spirv.IAdd %arg3, %cst15_i32 : i32
    %113 = spirv.SDiv %112, %cst16_i32 : i32
    %114 = spirv.IAdd %arg4, %cst15_i32 : i32
    %115 = spirv.SDiv %114, %cst16_i32 : i32
    %116 = spirv.IMul %115, %cst8_i32 : i32
    %117 = spirv.SDiv %108, %116 : i32
    %118 = spirv.IMul %117, %cst8_i32 : i32
    %119 = spirv.ISub %113, %118 : i32
    %120 = spirv.SLessThan %119, %cst8_i32 : i32
    %121 = spirv.Select %120, %119, %cst8_i32 : i1, i32
    %122 = spirv.SRem %108, %121 : i32
    %123 = spirv.IAdd %118, %122 : i32
    %124 = spirv.SRem %108, %116 : i32
    %125 = spirv.SDiv %124, %121 : i32
    %126 = spirv.IMul %123, %cst16_i32 : i32
    %127 = spirv.CompositeInsert %19, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %128 = spirv.CompositeInsert %20, %127[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %129 = spirv.CompositeInsert %21, %128[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %130 = spirv.CompositeInsert %22, %129[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %131 = spirv.CompositeInsert %23, %130[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %132 = spirv.CompositeInsert %24, %131[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %133 = spirv.CompositeInsert %25, %132[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %134 = spirv.Undef : !spirv.struct<(i32)>
    %135 = spirv.IAdd %126, %42 : i32
    %136 = spirv.IMul %125, %cst16_i32 : i32
    %137 = spirv.CompositeInsert %136, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %138 = spirv.CompositeInsert %136, %137[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %139 = spirv.CompositeInsert %136, %138[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %140 = spirv.CompositeInsert %136, %139[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %141 = spirv.CompositeInsert %136, %140[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %142 = spirv.CompositeInsert %136, %141[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %143 = spirv.CompositeInsert %136, %142[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %144 = spirv.IAdd %136, %19 : i32
    %145 = spirv.IAdd %136, %20 : i32
    %146 = spirv.IAdd %136, %21 : i32
    %147 = spirv.IAdd %136, %22 : i32
    %148 = spirv.IAdd %136, %23 : i32
    %149 = spirv.IAdd %136, %24 : i32
    %150 = spirv.IAdd %136, %25 : i32
    %151 = spirv.IAdd %136, %26 : i32
    %152 = spirv.CompositeInsert %144, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %153 = spirv.CompositeInsert %145, %152[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %154 = spirv.CompositeInsert %146, %153[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %155 = spirv.CompositeInsert %147, %154[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %156 = spirv.CompositeInsert %148, %155[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %157 = spirv.CompositeInsert %149, %156[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %158 = spirv.CompositeInsert %150, %157[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %159 = spirv.SRem %135, %arg3 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %160 = spirv.CompositeInsert %arg4, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %161 = spirv.CompositeInsert %arg4, %160[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %162 = spirv.CompositeInsert %arg4, %161[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %163 = spirv.CompositeInsert %arg4, %162[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %164 = spirv.CompositeInsert %arg4, %163[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %165 = spirv.CompositeInsert %arg4, %164[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %166 = spirv.CompositeInsert %arg4, %165[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %167 = spirv.SRem %144, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %168 = spirv.SRem %145, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %169 = spirv.SRem %146, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %170 = spirv.SRem %147, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %171 = spirv.SRem %148, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %172 = spirv.SRem %149, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %173 = spirv.SRem %150, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %174 = spirv.SRem %151, %arg4 {tt.contiguity = dense<16> : tensor<1xi32>, tt.divisibility = dense<16> : tensor<1xi32>} : i32
    %175 = spirv.CompositeInsert %167, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %176 = spirv.CompositeInsert %168, %175[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %177 = spirv.CompositeInsert %169, %176[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %178 = spirv.CompositeInsert %170, %177[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %179 = spirv.CompositeInsert %171, %178[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %180 = spirv.CompositeInsert %172, %179[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %181 = spirv.CompositeInsert %173, %180[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %182 = spirv.IMul %111, %cst16_i32 : i32
    %183 = spirv.CompositeInsert %182, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %184 = spirv.CompositeInsert %182, %183[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %185 = spirv.CompositeInsert %182, %184[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %186 = spirv.CompositeInsert %182, %185[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %187 = spirv.CompositeInsert %182, %186[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %188 = spirv.CompositeInsert %182, %187[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %189 = spirv.CompositeInsert %182, %188[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %190 = spirv.IAdd %182, %19 : i32
    %191 = spirv.IAdd %182, %20 : i32
    %192 = spirv.IAdd %182, %21 : i32
    %193 = spirv.IAdd %182, %22 : i32
    %194 = spirv.IAdd %182, %23 : i32
    %195 = spirv.IAdd %182, %24 : i32
    %196 = spirv.IAdd %182, %25 : i32
    %197 = spirv.IAdd %182, %26 : i32
    %198 = spirv.CompositeInsert %190, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %199 = spirv.CompositeInsert %191, %198[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %200 = spirv.CompositeInsert %192, %199[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %201 = spirv.CompositeInsert %193, %200[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %202 = spirv.CompositeInsert %194, %201[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %203 = spirv.CompositeInsert %195, %202[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %204 = spirv.CompositeInsert %196, %203[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %205 = spirv.IAdd %182, %42 : i32
    %206 = spirv.CompositeInsert %159, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %207 = spirv.CompositeInsert %159, %206[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %208 = spirv.CompositeInsert %159, %207[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %209 = spirv.CompositeInsert %159, %208[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %210 = spirv.CompositeInsert %159, %209[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %211 = spirv.CompositeInsert %159, %210[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %212 = spirv.CompositeInsert %159, %211[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %213 = spirv.CompositeInsert %arg6, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %214 = spirv.CompositeInsert %arg6, %213[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %215 = spirv.CompositeInsert %arg6, %214[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %216 = spirv.CompositeInsert %arg6, %215[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %217 = spirv.CompositeInsert %arg6, %216[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %218 = spirv.CompositeInsert %arg6, %217[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %219 = spirv.CompositeInsert %arg6, %218[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %220 = spirv.IMul %159, %arg6 : i32
    %221 = spirv.CompositeInsert %220, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %222 = spirv.CompositeInsert %220, %221[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %223 = spirv.CompositeInsert %220, %222[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %224 = spirv.CompositeInsert %220, %223[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %225 = spirv.CompositeInsert %220, %224[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %226 = spirv.CompositeInsert %220, %225[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %227 = spirv.CompositeInsert %220, %226[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %228 = spirv.IAdd %220, %190 : i32
    %229 = spirv.IAdd %220, %191 : i32
    %230 = spirv.IAdd %220, %192 : i32
    %231 = spirv.IAdd %220, %193 : i32
    %232 = spirv.IAdd %220, %194 : i32
    %233 = spirv.IAdd %220, %195 : i32
    %234 = spirv.IAdd %220, %196 : i32
    %235 = spirv.IAdd %220, %197 : i32
    %236 = spirv.CompositeInsert %228, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %237 = spirv.CompositeInsert %229, %236[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %238 = spirv.CompositeInsert %230, %237[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %239 = spirv.CompositeInsert %231, %238[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %240 = spirv.CompositeInsert %232, %239[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %241 = spirv.CompositeInsert %233, %240[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %242 = spirv.CompositeInsert %234, %241[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %243 = spirv.Undef : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %244 = spirv.CompositeInsert %arg0, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %245 = spirv.CompositeInsert %arg0, %244[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %246 = spirv.CompositeInsert %arg0, %245[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %247 = spirv.CompositeInsert %arg0, %246[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %248 = spirv.CompositeInsert %arg0, %247[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %249 = spirv.CompositeInsert %arg0, %248[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %250 = spirv.CompositeInsert %arg0, %249[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %251 = spirv.PtrAccessChain %arg0[%228] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %252 = spirv.PtrAccessChain %arg0[%229] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %253 = spirv.PtrAccessChain %arg0[%230] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %254 = spirv.PtrAccessChain %arg0[%231] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %255 = spirv.PtrAccessChain %arg0[%232] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %256 = spirv.PtrAccessChain %arg0[%233] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %257 = spirv.PtrAccessChain %arg0[%234] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %258 = spirv.PtrAccessChain %arg0[%235] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %259 = spirv.CompositeInsert %251, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %260 = spirv.CompositeInsert %252, %259[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %261 = spirv.CompositeInsert %253, %260[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %262 = spirv.CompositeInsert %254, %261[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %263 = spirv.CompositeInsert %255, %262[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %264 = spirv.CompositeInsert %256, %263[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %265 = spirv.CompositeInsert %257, %264[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %266 = spirv.CompositeInsert %258, %265[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %267 = spirv.CompositeInsert %205, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %268 = spirv.CompositeInsert %205, %267[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %269 = spirv.CompositeInsert %205, %268[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %270 = spirv.CompositeInsert %205, %269[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %271 = spirv.CompositeInsert %205, %270[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %272 = spirv.CompositeInsert %205, %271[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %273 = spirv.CompositeInsert %205, %272[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %274 = spirv.CompositeInsert %arg7, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %275 = spirv.CompositeInsert %arg7, %274[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %276 = spirv.CompositeInsert %arg7, %275[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %277 = spirv.CompositeInsert %arg7, %276[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %278 = spirv.CompositeInsert %arg7, %277[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %279 = spirv.CompositeInsert %arg7, %278[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %280 = spirv.CompositeInsert %arg7, %279[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %281 = spirv.IMul %205, %arg7 : i32
    %282 = spirv.CompositeInsert %281, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %283 = spirv.CompositeInsert %281, %282[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %284 = spirv.CompositeInsert %281, %283[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %285 = spirv.CompositeInsert %281, %284[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %286 = spirv.CompositeInsert %281, %285[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %287 = spirv.CompositeInsert %281, %286[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %288 = spirv.CompositeInsert %281, %287[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %289 = spirv.IAdd %281, %167 : i32
    %290 = spirv.IAdd %281, %168 : i32
    %291 = spirv.IAdd %281, %169 : i32
    %292 = spirv.IAdd %281, %170 : i32
    %293 = spirv.IAdd %281, %171 : i32
    %294 = spirv.IAdd %281, %172 : i32
    %295 = spirv.IAdd %281, %173 : i32
    %296 = spirv.IAdd %281, %174 : i32
    %297 = spirv.CompositeInsert %289, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %298 = spirv.CompositeInsert %290, %297[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %299 = spirv.CompositeInsert %291, %298[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %300 = spirv.CompositeInsert %292, %299[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %301 = spirv.CompositeInsert %293, %300[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %302 = spirv.CompositeInsert %294, %301[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %303 = spirv.CompositeInsert %295, %302[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %304 = spirv.CompositeInsert %arg1, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %305 = spirv.CompositeInsert %arg1, %304[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %306 = spirv.CompositeInsert %arg1, %305[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %307 = spirv.CompositeInsert %arg1, %306[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %308 = spirv.CompositeInsert %arg1, %307[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %309 = spirv.CompositeInsert %arg1, %308[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %310 = spirv.CompositeInsert %arg1, %309[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %311 = spirv.PtrAccessChain %arg1[%289] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %312 = spirv.PtrAccessChain %arg1[%290] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %313 = spirv.PtrAccessChain %arg1[%291] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %314 = spirv.PtrAccessChain %arg1[%292] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %315 = spirv.PtrAccessChain %arg1[%293] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %316 = spirv.PtrAccessChain %arg1[%294] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %317 = spirv.PtrAccessChain %arg1[%295] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %318 = spirv.PtrAccessChain %arg1[%296] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %319 = spirv.CompositeInsert %311, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %320 = spirv.CompositeInsert %312, %319[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %321 = spirv.CompositeInsert %313, %320[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %322 = spirv.CompositeInsert %314, %321[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %323 = spirv.CompositeInsert %315, %322[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %324 = spirv.CompositeInsert %316, %323[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %325 = spirv.CompositeInsert %317, %324[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %326 = spirv.CompositeInsert %318, %325[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %327 = spirv.IAdd %arg5, %cst15_i32 : i32
    %328 = spirv.SDiv %327, %cst16_i32 : i32
    %329 = spirv.IMul %arg7, %cst16_i32 : i32
    %330 = spirv.CompositeInsert %329, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %331 = spirv.CompositeInsert %329, %330[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %332 = spirv.CompositeInsert %329, %331[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %333 = spirv.CompositeInsert %329, %332[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %334 = spirv.CompositeInsert %329, %333[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %335 = spirv.CompositeInsert %329, %334[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %336 = spirv.CompositeInsert %329, %335[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %337 = spirv.SGreaterThan %328, %cst0_i32 : i32
    %338 = spirv.PtrAccessChain %arg9[%cst0_i32] : !spirv.ptr<i8, Workgroup>, i32
    %339 = spirv.Bitcast %338 : !spirv.ptr<i8, Workgroup> to !spirv.ptr<f16, Workgroup>
    %cst256_i32 = spirv.Constant 256 : i32
    %340 = spirv.Undef : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %341 = spirv.CompositeInsert %339, %340[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %342 = spirv.CompositeInsert %cst256_i32, %341[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %343 = spirv.CompositeInsert %cst16_i32, %342[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %344 = spirv.CompositeInsert %cst1_i32, %343[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %345 = spirv.CompositeInsert %cst0_i32, %344[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %346 = spirv.CompositeInsert %cst0_i32, %345[5 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %347 = spirv.CompositeInsert %cst0_i32, %346[6 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %348 = spirv.Undef : !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %349 = spirv.CompositeInsert %337, %348[0 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %350 = spirv.CompositeInsert %337, %349[1 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %351 = spirv.CompositeInsert %337, %350[2 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %352 = spirv.CompositeInsert %337, %351[3 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %353 = spirv.CompositeInsert %337, %352[4 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %354 = spirv.CompositeInsert %337, %353[5 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %355 = spirv.CompositeInsert %337, %354[6 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %356 = spirv.IMul %cst0_i32, %cst256_i32 : i32
    %357 = spirv.IAdd %cst0_i32, %356 : i32
    %358 = spirv.IMul %cst0_i32, %cst16_i32 : i32
    %359 = spirv.IAdd %357, %358 : i32
    %360 = spirv.IMul %cst0_i32, %cst1_i32 : i32
    %361 = spirv.IAdd %359, %360 : i32
    %362 = spirv.PtrAccessChain %339[%361] : !spirv.ptr<f16, Workgroup>, i32
    %363 = spirv.UDiv %62, %cst4_i32 : i32
    %364 = spirv.UMod %363, %cst2_i32 : i32
    %365 = spirv.IMul %62, %cst16_i32 : i32
    %366 = spirv.UDiv %67, %cst8_i32 : i32
    %367 = spirv.BitwiseXor %366, %364 : i32
    %368 = spirv.IMul %367, %cst8_i32 : i32
    %369 = spirv.UMod %67, %cst8_i32 : i32
    %370 = spirv.UDiv %369, %cst8_i32 : i32
    %371 = spirv.IMul %370, %cst8_i32 : i32
    %372 = spirv.IAdd %368, %371 : i32
    %373 = spirv.IMul %372, %cst1_i32 : i32
    %374 = spirv.IAdd %365, %373 : i32
    %375 = spirv.PtrAccessChain %362[%374] : !spirv.ptr<f16, Workgroup>, i32
    %376 = spirv.PtrAccessChain %375[%358] : !spirv.ptr<f16, Workgroup>, i32
    %377 = spirv.Bitcast %376 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %378 = spirv.PtrAccessChain %377[%cst0_i32] : !spirv.ptr<vector<4xi32>, Workgroup>, i32
    %379 = spirv.Bitcast %251 : !spirv.ptr<f16, CrossWorkgroup> to !spirv.ptr<vector<4xi32>, CrossWorkgroup>
    %380 = spirv.Undef : vector<4xi32>
    %381 = spirv.CompositeInsert %cst0_i32, %380[0 : i32] : i32 into vector<4xi32>
    %382 = spirv.CompositeInsert %cst0_i32, %381[1 : i32] : i32 into vector<4xi32>
    %383 = spirv.CompositeInsert %cst0_i32, %382[2 : i32] : i32 into vector<4xi32>
    %384 = spirv.CompositeInsert %cst0_i32, %383[3 : i32] : i32 into vector<4xi32>
    spirv.BranchConditional %337, ^bb1, ^bb2(%384 : vector<4xi32>)
  ^bb1:  // pred: ^bb0
    %385 = spirv.Load "CrossWorkgroup" %379 : vector<4xi32>
    spirv.Branch ^bb2(%385 : vector<4xi32>)
  ^bb2(%386: vector<4xi32>):  // 2 preds: ^bb0, ^bb1
    spirv.Store "Workgroup" %378, %386 : vector<4xi32>
    %cst1024_i32 = spirv.Constant 1024 : i32
    %387 = spirv.PtrAccessChain %arg9[%cst1024_i32] : !spirv.ptr<i8, Workgroup>, i32
    %388 = spirv.Bitcast %387 : !spirv.ptr<i8, Workgroup> to !spirv.ptr<f16, Workgroup>
    %389 = spirv.CompositeInsert %388, %340[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %390 = spirv.CompositeInsert %cst256_i32, %389[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %391 = spirv.CompositeInsert %cst16_i32, %390[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %392 = spirv.CompositeInsert %cst1_i32, %391[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %393 = spirv.CompositeInsert %cst0_i32, %392[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %394 = spirv.CompositeInsert %cst0_i32, %393[5 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %395 = spirv.CompositeInsert %cst0_i32, %394[6 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %396 = spirv.PtrAccessChain %388[%361] : !spirv.ptr<f16, Workgroup>, i32
    %397 = spirv.PtrAccessChain %396[%374] : !spirv.ptr<f16, Workgroup>, i32
    %398 = spirv.PtrAccessChain %397[%358] : !spirv.ptr<f16, Workgroup>, i32
    %399 = spirv.Bitcast %398 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %400 = spirv.PtrAccessChain %399[%cst0_i32] : !spirv.ptr<vector<4xi32>, Workgroup>, i32
    %401 = spirv.Bitcast %311 : !spirv.ptr<f16, CrossWorkgroup> to !spirv.ptr<vector<4xi32>, CrossWorkgroup>
    spirv.BranchConditional %337, ^bb3, ^bb4(%384 : vector<4xi32>)
  ^bb3:  // pred: ^bb2
    %402 = spirv.Load "CrossWorkgroup" %401 : vector<4xi32>
    spirv.Branch ^bb4(%402 : vector<4xi32>)
  ^bb4(%403: vector<4xi32>):  // 2 preds: ^bb2, ^bb3
    spirv.Store "Workgroup" %400, %403 : vector<4xi32>
    spirv.ControlBarrier <Workgroup>, <Workgroup>, <AcquireRelease|WorkgroupMemory>
    %404 = spirv.Undef : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %405 = spirv.CompositeInsert %362, %404[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %406 = spirv.CompositeInsert %cst16_i32, %405[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %407 = spirv.CompositeInsert %cst1_i32, %406[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %408 = spirv.CompositeInsert %cst0_i32, %407[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %409 = spirv.CompositeInsert %cst0_i32, %408[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %410 = spirv.CompositeInsert %396, %404[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %411 = spirv.CompositeInsert %cst16_i32, %410[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %412 = spirv.CompositeInsert %cst1_i32, %411[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %413 = spirv.CompositeInsert %cst0_i32, %412[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %414 = spirv.CompositeInsert %cst0_i32, %413[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    spirv.Branch ^bb5(%cst0_i32, %97, %266, %326, %347, %395, %409, %414, %266, %326, %cst0_i32, %cst1_i32, %cst1_i32 : i32, !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, i32, i32, i32)
  ^bb5(%415: i32, %416: !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>, %417: !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, %418: !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, %419: !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, %420: !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, %421: !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, %422: !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, %423: !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, %424: !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, %425: i32, %426: i32, %427: i32):  // 2 preds: ^bb4, ^bb10
    %428 = spirv.SLessThan %415, %328 : i32
    spirv.BranchConditional %428, ^bb6, ^bb11
  ^bb6:  // pred: ^bb5
    %429 = spirv.CompositeExtract %421[0 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %430 = spirv.CompositeExtract %421[1 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %431 = spirv.CompositeExtract %421[4 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %432 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %433 = spirv.CompositeExtract %432[0 : i32] : vector<3xi64>
    %434 = spirv.SConvert %433 : i64 to i32
    %435 = spirv.UDiv %434, %cst32_i32 : i32
    %436 = spirv.UMod %434, %cst32_i32 : i32
    %437 = spirv.UMod %435, %cst1_i32 : i32
    %438 = spirv.UDiv %435, %cst1_i32 : i32
    %439 = spirv.UMod %438, %cst1_i32 : i32
    %440 = spirv.UMod %439, %cst1_i32 : i32
    %441 = spirv.UMod %436, %cst8_i32 : i32
    %442 = spirv.UDiv %436, %cst8_i32 : i32
    %443 = spirv.UMod %442, %cst2_i32 : i32
    %444 = spirv.UDiv %442, %cst2_i32 : i32
    %445 = spirv.IMul %440, %cst2_i32 : i32
    %446 = spirv.IAdd %445, %443 : i32
    %447 = spirv.UDiv %431, %cst8_i32 : i32
    %448 = spirv.UDiv %441, %cst4_i32 : i32
    %449 = spirv.UMod %448, %cst2_i32 : i32
    %450 = spirv.IMul %446, %cst8_i32 : i32
    %451 = spirv.IAdd %441, %450 : i32
    %452 = spirv.UMod %451, %cst16_i32 : i32
    %453 = spirv.IAdd %444, %447 : i32
    %454 = spirv.BitwiseXor %453, %449 : i32
    %455 = spirv.IMul %452, %430 : i32
    %456 = spirv.IMul %454, %cst8_i32 : i32
    %457 = spirv.IAdd %456, %455 : i32
    %458 = spirv.IAdd %444, %cst2_i32 : i32
    %459 = spirv.IAdd %458, %447 : i32
    %460 = spirv.BitwiseXor %459, %449 : i32
    %461 = spirv.IMul %460, %cst8_i32 : i32
    %462 = spirv.IAdd %461, %455 : i32
    %463 = spirv.ISub %cst0_i32, %431 : i32
    %464 = spirv.PtrAccessChain %429[%463] : !spirv.ptr<f16, Workgroup>, i32
    %465 = spirv.PtrAccessChain %464[%457] : !spirv.ptr<f16, Workgroup>, i32
    %466 = spirv.IMul %cst0_i32, %430 : i32
    %467 = spirv.PtrAccessChain %465[%466] : !spirv.ptr<f16, Workgroup>, i32
    %468 = spirv.Bitcast %467 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %469 = spirv.Load "Workgroup" %468 : vector<4xi32>
    %470 = spirv.CompositeExtract %469[0 : i32] : vector<4xi32>
    %471 = spirv.CompositeExtract %469[1 : i32] : vector<4xi32>
    %472 = spirv.CompositeExtract %469[2 : i32] : vector<4xi32>
    %473 = spirv.CompositeExtract %469[3 : i32] : vector<4xi32>
    %474 = spirv.Undef : !spirv.struct<(i32, i32, i32, i32)>
    %475 = spirv.CompositeInsert %470, %474[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %476 = spirv.CompositeInsert %472, %475[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %477 = spirv.CompositeInsert %471, %476[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %478 = spirv.CompositeExtract %422[0 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %479 = spirv.CompositeExtract %422[1 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %480 = spirv.CompositeExtract %422[4 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %481 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %482 = spirv.CompositeExtract %481[0 : i32] : vector<3xi64>
    %483 = spirv.SConvert %482 : i64 to i32
    %484 = spirv.UDiv %483, %cst32_i32 : i32
    %485 = spirv.UMod %483, %cst32_i32 : i32
    %486 = spirv.UMod %484, %cst1_i32 : i32
    %487 = spirv.UDiv %484, %cst1_i32 : i32
    %488 = spirv.UMod %487, %cst1_i32 : i32
    %489 = spirv.UMod %486, %cst2_i32 : i32
    %490 = spirv.UMod %485, %cst8_i32 : i32
    %491 = spirv.UDiv %485, %cst8_i32 : i32
    %492 = spirv.UMod %491, %cst2_i32 : i32
    %493 = spirv.UDiv %491, %cst2_i32 : i32
    %494 = spirv.IAdd %489, %493 : i32
    %495 = spirv.UDiv %480, %cst8_i32 : i32
    %496 = spirv.UDiv %490, %cst4_i32 : i32
    %497 = spirv.UMod %496, %cst2_i32 : i32
    %498 = spirv.IMul %492, %cst8_i32 : i32
    %499 = spirv.IAdd %490, %498 : i32
    %500 = spirv.UMod %499, %cst16_i32 : i32
    %501 = spirv.IAdd %494, %495 : i32
    %502 = spirv.BitwiseXor %501, %497 : i32
    %503 = spirv.IMul %500, %479 : i32
    %504 = spirv.IMul %502, %cst8_i32 : i32
    %505 = spirv.IAdd %504, %503 : i32
    %506 = spirv.IAdd %494, %cst1_i32 : i32
    %507 = spirv.IAdd %506, %495 : i32
    %508 = spirv.BitwiseXor %507, %497 : i32
    %509 = spirv.IMul %508, %cst8_i32 : i32
    %510 = spirv.IAdd %509, %503 : i32
    %511 = spirv.ISub %cst0_i32, %480 : i32
    %512 = spirv.PtrAccessChain %478[%511] : !spirv.ptr<f16, Workgroup>, i32
    %513 = spirv.PtrAccessChain %512[%505] : !spirv.ptr<f16, Workgroup>, i32
    %514 = spirv.IMul %cst0_i32, %479 : i32
    %515 = spirv.PtrAccessChain %513[%514] : !spirv.ptr<f16, Workgroup>, i32
    %516 = spirv.Bitcast %515 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %517 = spirv.Load "Workgroup" %516 : vector<4xi32>
    %518 = spirv.CompositeExtract %517[0 : i32] : vector<4xi32>
    %519 = spirv.CompositeExtract %517[1 : i32] : vector<4xi32>
    %520 = spirv.CompositeExtract %517[2 : i32] : vector<4xi32>
    %521 = spirv.CompositeExtract %517[3 : i32] : vector<4xi32>
    %522 = spirv.CompositeInsert %518, %474[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %523 = spirv.CompositeInsert %519, %522[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %524 = spirv.CompositeInsert %520, %523[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32)>
    %SIMDwrapper_addr = spirv.mlir.addressof @SIMDwrapper : !spirv.ptr<(vector<16xi32>) -> vector<16xi32>, CodeSectionINTEL>
    %525 = spirv.FunctionCall @_Z33__regcall3____builtin_invoke_simdSIMDwrapper(%SIMDwrapper_addr, %cst0_i32) : (!spirv.ptr<(vector<16xi32>) -> vector<16xi32>, CodeSectionINTEL>, i32) -> i32
    %526 = spirv.Undef : f32
    %527 = spirv.FunctionCall @_Z33__regcall3____builtin_invoke_simdSIMDwrapper(%SIMDwrapper_addr, %cst0_i32) : (!spirv.ptr<(vector<16xi32>) -> vector<16xi32>, CodeSectionINTEL>, i32) -> i32
    %528 = spirv.CompositeInsert %526, %89[0 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %529 = spirv.CompositeInsert %526, %528[1 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %530 = spirv.CompositeInsert %526, %529[2 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %531 = spirv.CompositeInsert %526, %530[3 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %532 = spirv.CompositeInsert %526, %531[4 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %533 = spirv.CompositeInsert %526, %532[5 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %534 = spirv.CompositeInsert %526, %533[6 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %535 = spirv.CompositeInsert %526, %534[7 : i32] : f32 into !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %536 = spirv.CompositeExtract %417[0 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %537 = spirv.CompositeExtract %417[1 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %538 = spirv.CompositeExtract %417[2 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %539 = spirv.CompositeExtract %417[3 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %540 = spirv.CompositeExtract %417[4 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %541 = spirv.CompositeExtract %417[5 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %542 = spirv.CompositeExtract %417[6 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %543 = spirv.CompositeExtract %417[7 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %544 = spirv.PtrAccessChain %536[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %545 = spirv.PtrAccessChain %537[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %546 = spirv.PtrAccessChain %538[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %547 = spirv.PtrAccessChain %539[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %548 = spirv.PtrAccessChain %540[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %549 = spirv.PtrAccessChain %541[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %550 = spirv.PtrAccessChain %542[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %551 = spirv.PtrAccessChain %543[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %552 = spirv.CompositeInsert %544, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %553 = spirv.CompositeInsert %545, %552[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %554 = spirv.CompositeInsert %546, %553[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %555 = spirv.CompositeInsert %547, %554[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %556 = spirv.CompositeInsert %548, %555[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %557 = spirv.CompositeInsert %549, %556[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %558 = spirv.CompositeInsert %550, %557[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %559 = spirv.CompositeInsert %551, %558[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %560 = spirv.CompositeExtract %418[0 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %561 = spirv.CompositeExtract %418[1 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %562 = spirv.CompositeExtract %418[2 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %563 = spirv.CompositeExtract %418[3 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %564 = spirv.CompositeExtract %418[4 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %565 = spirv.CompositeExtract %418[5 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %566 = spirv.CompositeExtract %418[6 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %567 = spirv.CompositeExtract %418[7 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %568 = spirv.PtrAccessChain %560[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %569 = spirv.PtrAccessChain %561[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %570 = spirv.PtrAccessChain %562[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %571 = spirv.PtrAccessChain %563[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %572 = spirv.PtrAccessChain %564[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %573 = spirv.PtrAccessChain %565[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %574 = spirv.PtrAccessChain %566[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %575 = spirv.PtrAccessChain %567[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %576 = spirv.CompositeInsert %568, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %577 = spirv.CompositeInsert %569, %576[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %578 = spirv.CompositeInsert %570, %577[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %579 = spirv.CompositeInsert %571, %578[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %580 = spirv.CompositeInsert %572, %579[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %581 = spirv.CompositeInsert %573, %580[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %582 = spirv.CompositeInsert %574, %581[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %583 = spirv.CompositeInsert %575, %582[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %584 = spirv.IAdd %425, %cst1_i32 : i32
    %585 = spirv.SLessThan %584, %328 : i32
    %586 = spirv.SRem %426, %cst2_i32 : i32
    %587 = spirv.SRem %427, %cst2_i32 : i32
    %588 = spirv.CompositeExtract %423[0 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %589 = spirv.CompositeExtract %423[1 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %590 = spirv.CompositeExtract %423[2 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %591 = spirv.CompositeExtract %423[3 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %592 = spirv.CompositeExtract %423[4 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %593 = spirv.CompositeExtract %423[5 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %594 = spirv.CompositeExtract %423[6 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %595 = spirv.CompositeExtract %423[7 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %596 = spirv.PtrAccessChain %588[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %597 = spirv.PtrAccessChain %589[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %598 = spirv.PtrAccessChain %590[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %599 = spirv.PtrAccessChain %591[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %600 = spirv.PtrAccessChain %592[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %601 = spirv.PtrAccessChain %593[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %602 = spirv.PtrAccessChain %594[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %603 = spirv.PtrAccessChain %595[%cst16_i32] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %604 = spirv.CompositeInsert %596, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %605 = spirv.CompositeInsert %597, %604[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %606 = spirv.CompositeInsert %598, %605[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %607 = spirv.CompositeInsert %599, %606[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %608 = spirv.CompositeInsert %600, %607[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %609 = spirv.CompositeInsert %601, %608[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %610 = spirv.CompositeInsert %602, %609[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %611 = spirv.CompositeInsert %603, %610[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %612 = spirv.CompositeExtract %424[0 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %613 = spirv.CompositeExtract %424[1 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %614 = spirv.CompositeExtract %424[2 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %615 = spirv.CompositeExtract %424[3 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %616 = spirv.CompositeExtract %424[4 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %617 = spirv.CompositeExtract %424[5 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %618 = spirv.CompositeExtract %424[6 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %619 = spirv.CompositeExtract %424[7 : i32] : !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %620 = spirv.PtrAccessChain %612[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %621 = spirv.PtrAccessChain %613[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %622 = spirv.PtrAccessChain %614[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %623 = spirv.PtrAccessChain %615[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %624 = spirv.PtrAccessChain %616[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %625 = spirv.PtrAccessChain %617[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %626 = spirv.PtrAccessChain %618[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %627 = spirv.PtrAccessChain %619[%329] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %628 = spirv.CompositeInsert %620, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %629 = spirv.CompositeInsert %621, %628[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %630 = spirv.CompositeInsert %622, %629[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %631 = spirv.CompositeInsert %623, %630[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %632 = spirv.CompositeInsert %624, %631[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %633 = spirv.CompositeInsert %625, %632[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %634 = spirv.CompositeInsert %626, %633[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %635 = spirv.CompositeInsert %627, %634[7 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %636 = spirv.CompositeInsert %585, %348[0 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %637 = spirv.CompositeInsert %585, %636[1 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %638 = spirv.CompositeInsert %585, %637[2 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %639 = spirv.CompositeInsert %585, %638[3 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %640 = spirv.CompositeInsert %585, %639[4 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %641 = spirv.CompositeInsert %585, %640[5 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %642 = spirv.CompositeInsert %585, %641[6 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    spirv.ControlBarrier <Workgroup>, <Workgroup>, <AcquireRelease|WorkgroupMemory>
    %643 = spirv.CompositeExtract %419[0 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %644 = spirv.CompositeExtract %419[1 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %645 = spirv.CompositeExtract %419[2 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %646 = spirv.CompositeExtract %419[3 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %647 = spirv.IMul %586, %644 : i32
    %648 = spirv.IAdd %cst0_i32, %647 : i32
    %649 = spirv.IMul %cst0_i32, %645 : i32
    %650 = spirv.IAdd %648, %649 : i32
    %651 = spirv.IMul %cst0_i32, %646 : i32
    %652 = spirv.IAdd %650, %651 : i32
    %653 = spirv.PtrAccessChain %643[%652] : !spirv.ptr<f16, Workgroup>, i32
    %654 = spirv.IMul %62, %645 : i32
    %655 = spirv.IMul %372, %646 : i32
    %656 = spirv.IAdd %654, %655 : i32
    %657 = spirv.PtrAccessChain %653[%656] : !spirv.ptr<f16, Workgroup>, i32
    %658 = spirv.PtrAccessChain %657[%649] : !spirv.ptr<f16, Workgroup>, i32
    %659 = spirv.Bitcast %658 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %660 = spirv.PtrAccessChain %659[%cst0_i32] : !spirv.ptr<vector<4xi32>, Workgroup>, i32
    %661 = spirv.Bitcast %596 : !spirv.ptr<f16, CrossWorkgroup> to !spirv.ptr<vector<4xi32>, CrossWorkgroup>
    spirv.BranchConditional %585, ^bb7, ^bb8(%384 : vector<4xi32>)
  ^bb7:  // pred: ^bb6
    %662 = spirv.Load "CrossWorkgroup" %661 : vector<4xi32>
    spirv.Branch ^bb8(%662 : vector<4xi32>)
  ^bb8(%663: vector<4xi32>):  // 2 preds: ^bb6, ^bb7
    spirv.Store "Workgroup" %660, %663 : vector<4xi32>
    %664 = spirv.CompositeExtract %420[0 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %665 = spirv.CompositeExtract %420[1 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %666 = spirv.CompositeExtract %420[2 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %667 = spirv.CompositeExtract %420[3 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %668 = spirv.IMul %586, %665 : i32
    %669 = spirv.IAdd %cst0_i32, %668 : i32
    %670 = spirv.IMul %cst0_i32, %666 : i32
    %671 = spirv.IAdd %669, %670 : i32
    %672 = spirv.IMul %cst0_i32, %667 : i32
    %673 = spirv.IAdd %671, %672 : i32
    %674 = spirv.PtrAccessChain %664[%673] : !spirv.ptr<f16, Workgroup>, i32
    %675 = spirv.IMul %62, %666 : i32
    %676 = spirv.IMul %372, %667 : i32
    %677 = spirv.IAdd %675, %676 : i32
    %678 = spirv.PtrAccessChain %674[%677] : !spirv.ptr<f16, Workgroup>, i32
    %679 = spirv.PtrAccessChain %678[%670] : !spirv.ptr<f16, Workgroup>, i32
    %680 = spirv.Bitcast %679 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<4xi32>, Workgroup>
    %681 = spirv.PtrAccessChain %680[%cst0_i32] : !spirv.ptr<vector<4xi32>, Workgroup>, i32
    %682 = spirv.Bitcast %620 : !spirv.ptr<f16, CrossWorkgroup> to !spirv.ptr<vector<4xi32>, CrossWorkgroup>
    spirv.BranchConditional %585, ^bb9, ^bb10(%384 : vector<4xi32>)
  ^bb9:  // pred: ^bb8
    %683 = spirv.Load "CrossWorkgroup" %682 : vector<4xi32>
    spirv.Branch ^bb10(%683 : vector<4xi32>)
  ^bb10(%684: vector<4xi32>):  // 2 preds: ^bb8, ^bb9
    spirv.Store "Workgroup" %681, %684 : vector<4xi32>
    spirv.ControlBarrier <Workgroup>, <Workgroup>, <AcquireRelease|WorkgroupMemory>
    %685 = spirv.CompositeExtract %419[4 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %686 = spirv.CompositeExtract %419[5 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %687 = spirv.CompositeExtract %419[6 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %688 = spirv.IMul %587, %644 : i32
    %689 = spirv.IAdd %cst0_i32, %688 : i32
    %690 = spirv.IAdd %689, %649 : i32
    %691 = spirv.IAdd %690, %651 : i32
    %692 = spirv.PtrAccessChain %643[%691] : !spirv.ptr<f16, Workgroup>, i32
    %693 = spirv.CompositeInsert %692, %404[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %694 = spirv.CompositeInsert %645, %693[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %695 = spirv.CompositeInsert %646, %694[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %696 = spirv.CompositeInsert %686, %695[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %697 = spirv.CompositeInsert %687, %696[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %698 = spirv.CompositeExtract %420[4 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %699 = spirv.CompositeExtract %420[5 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %700 = spirv.CompositeExtract %420[6 : i32] : !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>
    %701 = spirv.IMul %587, %665 : i32
    %702 = spirv.IAdd %cst0_i32, %701 : i32
    %703 = spirv.IAdd %702, %670 : i32
    %704 = spirv.IAdd %703, %672 : i32
    %705 = spirv.PtrAccessChain %664[%704] : !spirv.ptr<f16, Workgroup>, i32
    %706 = spirv.CompositeInsert %705, %404[0 : i32] : !spirv.ptr<f16, Workgroup> into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %707 = spirv.CompositeInsert %666, %706[1 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %708 = spirv.CompositeInsert %667, %707[2 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %709 = spirv.CompositeInsert %699, %708[3 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %710 = spirv.CompositeInsert %700, %709[4 : i32] : i32 into !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>
    %711 = spirv.IAdd %426, %cst1_i32 : i32
    %712 = spirv.IAdd %427, %cst1_i32 : i32
    %713 = spirv.IAdd %415, %cst1_i32 : i32
    spirv.Branch ^bb5(%713, %535, %559, %583, %419, %420, %697, %710, %611, %635, %584, %711, %712 : i32, !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, Workgroup>, i32, i32, i32, i32)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>, i32, i32, i32)
  ^bb11:  // pred: ^bb5
    spirv.ControlBarrier <Workgroup>, <Workgroup>, <AcquireRelease|WorkgroupMemory>
    %714 = spirv.CompositeExtract %416[0 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %715 = spirv.CompositeExtract %416[1 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %716 = spirv.CompositeExtract %416[2 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %717 = spirv.CompositeExtract %416[3 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %718 = spirv.CompositeExtract %416[4 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %719 = spirv.CompositeExtract %416[5 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %720 = spirv.CompositeExtract %416[6 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %721 = spirv.CompositeExtract %416[7 : i32] : !spirv.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
    %722 = spirv.FConvert %714 : f32 to f16
    %723 = spirv.FConvert %715 : f32 to f16
    %724 = spirv.FConvert %716 : f32 to f16
    %725 = spirv.FConvert %717 : f32 to f16
    %726 = spirv.FConvert %718 : f32 to f16
    %727 = spirv.FConvert %719 : f32 to f16
    %728 = spirv.FConvert %720 : f32 to f16
    %729 = spirv.FConvert %721 : f32 to f16
    %730 = spirv.Undef : !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %731 = spirv.CompositeInsert %722, %730[0 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %732 = spirv.CompositeInsert %723, %731[1 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %733 = spirv.CompositeInsert %724, %732[2 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %734 = spirv.CompositeInsert %725, %733[3 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %735 = spirv.CompositeInsert %726, %734[4 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %736 = spirv.CompositeInsert %727, %735[5 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %737 = spirv.CompositeInsert %728, %736[6 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %738 = spirv.CompositeInsert %135, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %739 = spirv.CompositeInsert %135, %738[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %740 = spirv.CompositeInsert %135, %739[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %741 = spirv.CompositeInsert %135, %740[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %742 = spirv.CompositeInsert %135, %741[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %743 = spirv.CompositeInsert %135, %742[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %744 = spirv.CompositeInsert %135, %743[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %745 = spirv.CompositeInsert %arg8, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %746 = spirv.CompositeInsert %arg8, %745[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %747 = spirv.CompositeInsert %arg8, %746[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %748 = spirv.CompositeInsert %arg8, %747[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %749 = spirv.CompositeInsert %arg8, %748[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %750 = spirv.CompositeInsert %arg8, %749[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %751 = spirv.CompositeInsert %arg8, %750[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %752 = spirv.IMul %135, %arg8 : i32
    %753 = spirv.CompositeInsert %752, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %754 = spirv.CompositeInsert %752, %753[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %755 = spirv.CompositeInsert %752, %754[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %756 = spirv.CompositeInsert %752, %755[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %757 = spirv.CompositeInsert %752, %756[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %758 = spirv.CompositeInsert %752, %757[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %759 = spirv.CompositeInsert %752, %758[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %760 = spirv.IAdd %752, %144 : i32
    %761 = spirv.IAdd %752, %145 : i32
    %762 = spirv.IAdd %752, %146 : i32
    %763 = spirv.IAdd %752, %147 : i32
    %764 = spirv.IAdd %752, %148 : i32
    %765 = spirv.IAdd %752, %149 : i32
    %766 = spirv.IAdd %752, %150 : i32
    %767 = spirv.IAdd %752, %151 : i32
    %768 = spirv.CompositeInsert %760, %98[0 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %769 = spirv.CompositeInsert %761, %768[1 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %770 = spirv.CompositeInsert %762, %769[2 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %771 = spirv.CompositeInsert %763, %770[3 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %772 = spirv.CompositeInsert %764, %771[4 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %773 = spirv.CompositeInsert %765, %772[5 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %774 = spirv.CompositeInsert %766, %773[6 : i32] : i32 into !spirv.struct<(i32, i32, i32, i32, i32, i32, i32, i32)>
    %775 = spirv.CompositeInsert %arg2, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %776 = spirv.CompositeInsert %arg2, %775[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %777 = spirv.CompositeInsert %arg2, %776[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %778 = spirv.CompositeInsert %arg2, %777[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %779 = spirv.CompositeInsert %arg2, %778[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %780 = spirv.CompositeInsert %arg2, %779[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %781 = spirv.CompositeInsert %arg2, %780[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %782 = spirv.PtrAccessChain %arg2[%760] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %783 = spirv.PtrAccessChain %arg2[%761] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %784 = spirv.PtrAccessChain %arg2[%762] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %785 = spirv.PtrAccessChain %arg2[%763] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %786 = spirv.PtrAccessChain %arg2[%764] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %787 = spirv.PtrAccessChain %arg2[%765] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %788 = spirv.PtrAccessChain %arg2[%766] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %789 = spirv.PtrAccessChain %arg2[%767] : !spirv.ptr<f16, CrossWorkgroup>, i32
    %790 = spirv.CompositeInsert %782, %243[0 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %791 = spirv.CompositeInsert %783, %790[1 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %792 = spirv.CompositeInsert %784, %791[2 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %793 = spirv.CompositeInsert %785, %792[3 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %794 = spirv.CompositeInsert %786, %793[4 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %795 = spirv.CompositeInsert %787, %794[5 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %796 = spirv.CompositeInsert %788, %795[6 : i32] : !spirv.ptr<f16, CrossWorkgroup> into !spirv.struct<(!spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>, !spirv.ptr<f16, CrossWorkgroup>)>
    %797 = spirv.SLessThan %135, %arg3 : i32
    %798 = spirv.Undef : !spirv.struct<(i1)>
    %799 = spirv.CompositeInsert %797, %348[0 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %800 = spirv.CompositeInsert %797, %799[1 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %801 = spirv.CompositeInsert %797, %800[2 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %802 = spirv.CompositeInsert %797, %801[3 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %803 = spirv.CompositeInsert %797, %802[4 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %804 = spirv.CompositeInsert %797, %803[5 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %805 = spirv.CompositeInsert %797, %804[6 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %806 = spirv.SLessThan %144, %arg4 : i32
    %807 = spirv.SLessThan %145, %arg4 : i32
    %808 = spirv.SLessThan %146, %arg4 : i32
    %809 = spirv.SLessThan %147, %arg4 : i32
    %810 = spirv.SLessThan %148, %arg4 : i32
    %811 = spirv.SLessThan %149, %arg4 : i32
    %812 = spirv.SLessThan %150, %arg4 : i32
    %813 = spirv.SLessThan %151, %arg4 : i32
    %814 = spirv.CompositeInsert %806, %348[0 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %815 = spirv.CompositeInsert %807, %814[1 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %816 = spirv.CompositeInsert %808, %815[2 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %817 = spirv.CompositeInsert %809, %816[3 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %818 = spirv.CompositeInsert %810, %817[4 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %819 = spirv.CompositeInsert %811, %818[5 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %820 = spirv.CompositeInsert %812, %819[6 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %821 = spirv.LogicalAnd %797, %806 : i1
    %822 = spirv.LogicalAnd %797, %807 : i1
    %823 = spirv.LogicalAnd %797, %808 : i1
    %824 = spirv.LogicalAnd %797, %809 : i1
    %825 = spirv.LogicalAnd %797, %810 : i1
    %826 = spirv.LogicalAnd %797, %811 : i1
    %827 = spirv.LogicalAnd %797, %812 : i1
    %828 = spirv.LogicalAnd %797, %813 : i1
    %829 = spirv.CompositeInsert %821, %348[0 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %830 = spirv.CompositeInsert %822, %829[1 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %831 = spirv.CompositeInsert %823, %830[2 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %832 = spirv.CompositeInsert %824, %831[3 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %833 = spirv.CompositeInsert %825, %832[4 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %834 = spirv.CompositeInsert %826, %833[5 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %835 = spirv.CompositeInsert %827, %834[6 : i32] : i1 into !spirv.struct<(i1, i1, i1, i1, i1, i1, i1, i1)>
    %836 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %837 = spirv.CompositeExtract %836[0 : i32] : vector<3xi64>
    %838 = spirv.SConvert %837 : i64 to i32
    %839 = spirv.UMod %838, %cst32_i32 : i32
    %840 = spirv.UDiv %838, %cst32_i32 : i32
    %841 = spirv.UMod %840, %cst1_i32 : i32
    %842 = spirv.UDiv %840, %cst1_i32 : i32
    %843 = spirv.UMod %842, %cst1_i32 : i32
    %844 = spirv.UMod %843, %cst1_i32 : i32
    %845 = spirv.UMod %841, %cst2_i32 : i32
    %846 = spirv.UDiv %839, %cst4_i32 : i32
    %847 = spirv.IAdd %846, %cst8_i32 : i32
    %848 = spirv.UMod %839, %cst4_i32 : i32
    %849 = spirv.IMul %848, %cst2_i32 : i32
    %850 = spirv.IAdd %849, %cst1_i32 : i32
    %851 = spirv.IMul %844, %cst16_i32 : i32
    %852 = spirv.IAdd %846, %851 : i32
    %853 = spirv.IMul %845, %cst8_i32 : i32
    %854 = spirv.IAdd %849, %853 : i32
    %cst24_i32 = spirv.Constant 24 : i32
    %855 = spirv.IMul %852, %cst24_i32 : i32
    %856 = spirv.IAdd %855, %854 : i32
    %857 = spirv.PtrAccessChain %339[%856] : !spirv.ptr<f16, Workgroup>, i32
    %858 = spirv.Bitcast %857 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<2xf16>, Workgroup>
    %859 = spirv.Undef : vector<2xf16>
    %cst0_i64 = spirv.Constant 0 : i64
    %860 = spirv.VectorInsertDynamic %722, %859[%cst0_i64] : vector<2xf16>, i64
    %cst1_i64 = spirv.Constant 1 : i64
    %861 = spirv.VectorInsertDynamic %723, %860[%cst1_i64] : vector<2xf16>, i64
    spirv.Store "Workgroup" %858, %861 : vector<2xf16>
    %862 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %863 = spirv.CompositeExtract %862[0 : i32] : vector<3xi64>
    %864 = spirv.SConvert %863 : i64 to i32
    %865 = spirv.UMod %864, %cst32_i32 : i32
    %866 = spirv.UDiv %864, %cst32_i32 : i32
    %867 = spirv.UMod %866, %cst1_i32 : i32
    %868 = spirv.UDiv %866, %cst1_i32 : i32
    %869 = spirv.UMod %868, %cst1_i32 : i32
    %870 = spirv.UMod %869, %cst1_i32 : i32
    %871 = spirv.UMod %867, %cst2_i32 : i32
    %872 = spirv.UDiv %865, %cst4_i32 : i32
    %873 = spirv.IAdd %872, %cst8_i32 : i32
    %874 = spirv.UMod %865, %cst4_i32 : i32
    %875 = spirv.IMul %874, %cst2_i32 : i32
    %876 = spirv.IAdd %875, %cst1_i32 : i32
    %877 = spirv.IMul %870, %cst16_i32 : i32
    %878 = spirv.IAdd %873, %877 : i32
    %879 = spirv.IMul %871, %cst8_i32 : i32
    %880 = spirv.IAdd %875, %879 : i32
    %881 = spirv.IMul %878, %cst24_i32 : i32
    %882 = spirv.IAdd %881, %880 : i32
    %883 = spirv.PtrAccessChain %339[%882] : !spirv.ptr<f16, Workgroup>, i32
    %884 = spirv.Bitcast %883 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<2xf16>, Workgroup>
    %885 = spirv.VectorInsertDynamic %724, %859[%cst0_i64] : vector<2xf16>, i64
    %886 = spirv.VectorInsertDynamic %725, %885[%cst1_i64] : vector<2xf16>, i64
    spirv.Store "Workgroup" %884, %886 : vector<2xf16>
    %887 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %888 = spirv.CompositeExtract %887[0 : i32] : vector<3xi64>
    %889 = spirv.SConvert %888 : i64 to i32
    %890 = spirv.UMod %889, %cst32_i32 : i32
    %891 = spirv.UDiv %889, %cst32_i32 : i32
    %892 = spirv.UMod %891, %cst1_i32 : i32
    %893 = spirv.UDiv %891, %cst1_i32 : i32
    %894 = spirv.UMod %893, %cst1_i32 : i32
    %895 = spirv.UMod %894, %cst1_i32 : i32
    %896 = spirv.UMod %892, %cst2_i32 : i32
    %897 = spirv.UDiv %890, %cst4_i32 : i32
    %898 = spirv.IAdd %897, %cst8_i32 : i32
    %899 = spirv.UMod %890, %cst4_i32 : i32
    %900 = spirv.IMul %899, %cst2_i32 : i32
    %901 = spirv.IAdd %900, %cst1_i32 : i32
    %902 = spirv.IMul %895, %cst16_i32 : i32
    %903 = spirv.IAdd %897, %902 : i32
    %904 = spirv.IMul %896, %cst8_i32 : i32
    %905 = spirv.IAdd %900, %904 : i32
    %906 = spirv.IAdd %905, %cst8_i32 : i32
    %907 = spirv.IMul %903, %cst24_i32 : i32
    %908 = spirv.IAdd %907, %906 : i32
    %909 = spirv.PtrAccessChain %339[%908] : !spirv.ptr<f16, Workgroup>, i32
    %910 = spirv.Bitcast %909 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<2xf16>, Workgroup>
    %911 = spirv.VectorInsertDynamic %726, %859[%cst0_i64] : vector<2xf16>, i64
    %912 = spirv.VectorInsertDynamic %727, %911[%cst1_i64] : vector<2xf16>, i64
    spirv.Store "Workgroup" %910, %912 : vector<2xf16>
    %913 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %914 = spirv.CompositeExtract %913[0 : i32] : vector<3xi64>
    %915 = spirv.SConvert %914 : i64 to i32
    %916 = spirv.UMod %915, %cst32_i32 : i32
    %917 = spirv.UDiv %915, %cst32_i32 : i32
    %918 = spirv.UMod %917, %cst1_i32 : i32
    %919 = spirv.UDiv %917, %cst1_i32 : i32
    %920 = spirv.UMod %919, %cst1_i32 : i32
    %921 = spirv.UMod %920, %cst1_i32 : i32
    %922 = spirv.UMod %918, %cst2_i32 : i32
    %923 = spirv.UDiv %916, %cst4_i32 : i32
    %924 = spirv.IAdd %923, %cst8_i32 : i32
    %925 = spirv.UMod %916, %cst4_i32 : i32
    %926 = spirv.IMul %925, %cst2_i32 : i32
    %927 = spirv.IAdd %926, %cst1_i32 : i32
    %928 = spirv.IMul %921, %cst16_i32 : i32
    %929 = spirv.IAdd %924, %928 : i32
    %930 = spirv.IMul %922, %cst8_i32 : i32
    %931 = spirv.IAdd %926, %930 : i32
    %932 = spirv.IAdd %931, %cst8_i32 : i32
    %933 = spirv.IMul %929, %cst24_i32 : i32
    %934 = spirv.IAdd %933, %932 : i32
    %935 = spirv.PtrAccessChain %339[%934] : !spirv.ptr<f16, Workgroup>, i32
    %936 = spirv.Bitcast %935 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<2xf16>, Workgroup>
    %937 = spirv.VectorInsertDynamic %728, %859[%cst0_i64] : vector<2xf16>, i64
    %938 = spirv.VectorInsertDynamic %729, %937[%cst1_i64] : vector<2xf16>, i64
    spirv.Store "Workgroup" %936, %938 : vector<2xf16>
    spirv.ControlBarrier <Workgroup>, <Workgroup>, <AcquireRelease|WorkgroupMemory>
    %939 = spirv.IMul %83, %cst24_i32 : i32
    %940 = spirv.IAdd %939, %88 : i32
    %941 = spirv.PtrAccessChain %339[%940] : !spirv.ptr<f16, Workgroup>, i32
    %942 = spirv.Bitcast %941 : !spirv.ptr<f16, Workgroup> to !spirv.ptr<vector<8xf16>, Workgroup>
    %943 = spirv.Load "Workgroup" %942 : vector<8xf16>
    %944 = spirv.VectorExtractDynamic %943[%cst0_i64] : vector<8xf16>, i64
    %945 = spirv.VectorExtractDynamic %943[%cst1_i64] : vector<8xf16>, i64
    %cst2_i64 = spirv.Constant 2 : i64
    %946 = spirv.VectorExtractDynamic %943[%cst2_i64] : vector<8xf16>, i64
    %cst3_i64 = spirv.Constant 3 : i64
    %947 = spirv.VectorExtractDynamic %943[%cst3_i64] : vector<8xf16>, i64
    %cst4_i64 = spirv.Constant 4 : i64
    %948 = spirv.VectorExtractDynamic %943[%cst4_i64] : vector<8xf16>, i64
    %cst5_i64 = spirv.Constant 5 : i64
    %949 = spirv.VectorExtractDynamic %943[%cst5_i64] : vector<8xf16>, i64
    %cst6_i64 = spirv.Constant 6 : i64
    %950 = spirv.VectorExtractDynamic %943[%cst6_i64] : vector<8xf16>, i64
    %cst7_i64 = spirv.Constant 7 : i64
    %951 = spirv.VectorExtractDynamic %943[%cst7_i64] : vector<8xf16>, i64
    %952 = spirv.CompositeInsert %944, %730[0 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %953 = spirv.CompositeInsert %945, %952[1 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %954 = spirv.CompositeInsert %946, %953[2 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %955 = spirv.CompositeInsert %947, %954[3 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %956 = spirv.CompositeInsert %948, %955[4 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %957 = spirv.CompositeInsert %949, %956[5 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %958 = spirv.CompositeInsert %950, %957[6 : i32] : f16 into !spirv.struct<(f16, f16, f16, f16, f16, f16, f16, f16)>
    %true = spirv.Constant true
    %959 = spirv.Load "Input" %__builtin_var_LocalInvocationId___addr : vector<3xi64>
    %960 = spirv.CompositeExtract %959[0 : i32] : vector<3xi64>
    %961 = spirv.SConvert %960 : i64 to i32
    %962 = spirv.UMod %961, %cst32_i32 : i32
    %963 = spirv.UDiv %961, %cst32_i32 : i32
    %964 = spirv.UDiv %963, %cst1_i32 : i32
    %965 = spirv.UDiv %962, %cst2_i32 : i32
    %966 = spirv.LogicalAnd %true, %821 : i1
    spirv.BranchConditional %966, ^bb12, ^bb13
  ^bb12:  // pred: ^bb11
    %967 = spirv.VectorInsertDynamic %944, %859[%cst0_i32] : vector<2xf16>, i32
    %968 = spirv.VectorInsertDynamic %945, %967[%cst1_i32] : vector<2xf16>, i32
    %969 = spirv.Bitcast %968 : vector<2xf16> to i32
    %970 = spirv.VectorInsertDynamic %969, %380[%cst0_i32] : vector<4xi32>, i32
    %971 = spirv.VectorInsertDynamic %946, %859[%cst0_i32] : vector<2xf16>, i32
    %972 = spirv.VectorInsertDynamic %947, %971[%cst1_i32] : vector<2xf16>, i32
    %973 = spirv.Bitcast %972 : vector<2xf16> to i32
    %974 = spirv.VectorInsertDynamic %973, %970[%cst1_i32] : vector<4xi32>, i32
    %975 = spirv.VectorInsertDynamic %948, %859[%cst0_i32] : vector<2xf16>, i32
    %976 = spirv.VectorInsertDynamic %949, %975[%cst1_i32] : vector<2xf16>, i32
    %977 = spirv.Bitcast %976 : vector<2xf16> to i32
    %978 = spirv.VectorInsertDynamic %977, %974[%cst2_i32] : vector<4xi32>, i32
    %979 = spirv.VectorInsertDynamic %950, %859[%cst0_i32] : vector<2xf16>, i32
    %980 = spirv.VectorInsertDynamic %951, %979[%cst1_i32] : vector<2xf16>, i32
    %981 = spirv.Bitcast %980 : vector<2xf16> to i32
    %982 = spirv.VectorInsertDynamic %981, %978[%cst3_i32] : vector<4xi32>, i32
    %983 = spirv.Bitcast %782 : !spirv.ptr<f16, CrossWorkgroup> to !spirv.ptr<vector<4xi32>, CrossWorkgroup>
    spirv.Store "CrossWorkgroup" %983, %982 : vector<4xi32>
    spirv.Branch ^bb13
  ^bb13:  // 2 preds: ^bb11, ^bb12
    spirv.Return
  }
}
Jianhui-Li commented 1 year ago

But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.

Could you please explain what is the exact use case for the "might have to use" scenario?

Jianhui-Li commented 1 year ago

Also benchmark would be convincing. Like a SYCL example emulating the real use case, showing the benefit of using invoke_SIMD from SIMT code.

I am not sure how much the invoke_SIMD overhead is, whether it will make these type of mixing not very appealing.

chengjunlu commented 1 year ago

But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.

Could you please explain what is the exact use case for the "might have to use" scenario?

The SPIRV JointMatrixMatmul is hard to achieve best performance. We may need to explicitly to use the DPAS in the IR for pre-op and post-op fusing in GEMM.

Jianhui-Li commented 1 year ago

But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.

Could you please explain what is the exact use case for the "might have to use" scenario?

The SPIRV JointMatrixMatmul is hard to achieve best performance. We may need to explicitly to use the DPAS in the IR for pre-op and post-op fusing in GEMM.

I have concerns about the mix overhead.

chengjunlu commented 1 year ago

But for some minor portion of the kernel, we might have to use the SIMD paradigm for performance or functionality. Like: we may want to use the VC intrinsic in the Triton kernel.

Could you please explain what is the exact use case for the "might have to use" scenario?

The SPIRV JointMatrixMatmul is hard to achieve best performance. We may need to explicitly to use the DPAS in the IR for pre-op and post-op fusing in GEMM.

I have concerns about the mix overhead.

Yeah. Based on the SYCL example. The SIMT-SIMD calling convention is not as good as expected. The IGC uses the register call for mixing the SIMT-SIMD functions.


        call (16|M0)             r127.0        L_f0__BB_0_0                     {A@1}                // $75
    .....
L_f0__BB_0_0:
(W)     mov (2|M0)               r2.2<1>:ud    r26.0<1;1,0>:ud                                       // $1
(W)     asr (1|M0)               r2.1<1>:d     r29.0<0;1,0>:d    31:w               {Compacted}      // $4
(W)     shl (1|M0)               r2.0<1>:d     r29.0<0;1,0>:d    2:w               {Compacted}       // $7
(W)     shr (1|M0)               r2.4<1>:ud    r29.0<0;1,0>:ud   0x1E:uw                             // $6
(W)     shl (1|M0)               r2.5<1>:d     r2.1<0;1,0>:d     2:w               {I@3}             // $5
(W)     addc (1|M0)              r3.0<1>:ud    r2.2<0;1,0>:ud    r2.0<0;1,0>:ud   {AccWrEn,I@3}      // $9
(W)     or (1|M0)                r2.1<1>:d     r2.5<0;1,0>:d     r2.4<0;1,0>:d    {I@2}              // $8
(W)     mov (1|M0)               r5.0<1>:ud    acc0.0<0;1,0>:ud                 {Compacted}          // $9
(W)     mov (1|M0)               r4.0<1>:f     r3.0<0;1,0>:f                    {Compacted,I@3}      // $10
(W)     add3 (1|M0)              r4.1<1>:d     r5.0<0;0>:d       r2.3<0;0>:d       r2.1<0>:d        {I@1} // $11
(W)     shr (1|M0)               a0.2<1>:ud    r126.7<0;1,0>:ud  0x4:ud              {F@1}           // $1
(W)     send.dc1 (16|M0)         r2       r4      null:0  0x0            0x022D0BFF           {A@1,$0} // wr:1h+0, rd:2; a64 aligned oword block read x4 // $12
(W)     add (1|M0)               r126.0<1>:ud  r127.2<0;1,0>:ud  0x0:ud              {Compacted}     // $1
(W)     mov (4|M0)               r59.4<1>:ud   r127.0<1;1,0>:ud                                      //  save vISA SP/FP to temp; $1
(W)     store.ugm.d32x8t.a32 (1|M0)  ss[a0.2][r126:1] r127:1       {ExBSO,A@2,$1} // ex_desc:a0.2; desc:0x4200C504 //  spill to FP[0*32] of ?; $1
(W)     mov (1|M0)               r127.3<1>:ud  r127.2<0;1,0>:ud                 {$1.src}             //  vISA_FP = vISA_SP; $1
(W)     add (1|M0)               r127.2<1>:ud  r127.2<0;1,0>:ud  0x40:ud                             //  vISA_SP += vISA_frameSize; $1
(W)     mov (4|M0)               r127.0<1>:ud  r59.4<1;1,0>:ud                  {I@3}                //  restore vISA SP/FP from temp; $15
(W)     add (16|M0)              r4.0<1>:f     r2.0<1;1,0>:f     r27.0<1;1,0>:f   {Compacted,$0.dst} // $13
(W)     add (16|M0)              r26.0<1>:f    r2.0<1;1,0>:f     r27.0<1;1,0>:f   {Compacted}        // $14
        ret (16|M0)                          r127.0                           {A@1}                  // $15

This is not optimized for now. But I think it could be optimized at link phase by replacing the register function call to inline function call. And do some link phase optimization.

The SIMT-SIMD convention is a good mechanism for us to align our SIMT paradigm and SIMD paradigm. (Like: calling XeTLA micro kernel inside the Triton Kernel.)

Jianhui-Li commented 1 year ago

Yes. We will enable this one. We would like to hide this within XeTile dialect as first step. Then we may need additional pass in the integration code (say Triton side) to merge multiple invoke_SIMD call into one.

In the future, when you say "performance not as expected", please report exact how much you expect, and how much it is currently. The benchmark should be close to real case as much as possible. For this time, we will build micro benchmark to track the XeTile level - like load/store/dpas of shapes with this numbers 8, 16, 24, 32, 64. and that will give us good understanding how the overhead is.

chengjunlu commented 1 year ago

I have tried the patches for supporting this. We can close this issue when it is upstreamed.