Closed c-rhodes closed 1 year ago
Thanks a lot for the RFC! I just wanted to post a quick comment to let you know that I'm processing the details and this hasn't fallen through the cracks :) I wonder, though, if this is something that we should move to the MLIR discourse. I have the impression that we want to have single way to represent SSME in MLIR, beyond IREE, and then take that representation and implement it within IREE. WDYT?
Thanks a lot for the RFC! I just wanted to post a quick comment to let you know that I'm processing the details and this hasn't fallen through the cracks :) I wonder, though, if this is something that we should move to the MLIR discourse. I have the impression that we want to have single way to represent SSME in MLIR, beyond IREE, and then take that representation and implement it within IREE. WDYT?
Initially that was the plan but I was focused on enabling SSVE in IREE and that influenced the design, I wasn't sure other users of MLIR had the same constraints. To represent this in MLIR I would propose adding a pass that adds the aarch64_pstate_sm_enabled
attribute to all functions, and later come up with a heuristic that selectively enables SSVE. I'd be happy to remove IREE specific stuff from this RFC and post on LLVM discourse.
That's a very detailed and a well crafted RFC, thank you @c-rhodes !
To represent this in MLIR I would propose adding a pass that adds the aarch64_pstate_sm_enabled attribute to all functions
Wouldn't aarch64_pstate_sm_body
work equally well in MLIR? I know that in IREE that's effectively the only option (aarch64_pstate_sm_enabled
changes the ABI, and we want to avoid that). Basically, aarch64_pstate_sm_body
feels like the "safe option" for now, regardless of whether focusing on IREE or MLIR. And we can always revisit later once this choice is too limiting. So, unless I am missing something, we could land this in MLIR as is :)
Following up to my reply to the MLIR post…
Another option is to implement this as a pass in either MLIR or IREE
I think that at least the utility that annotates the functions with the attributes should be implemented and tested in MLIR. Also the lowering to LLVM. That would help the community align with the proposed SSME representation and build on top of that without reinventing the wheel. We don’t have to restrict it to IREE’s specific use case. The utility could take a function and a “streaming mode” and return the annotated function which would eventually be lowered to LLVM.
This heuristic could be leveraged to selectively enable SSVE by integrating the MLIR pass into the relevant lowering configuration pipelines +1. If you initially don’t feel very confident about these heuristics, this is something we can incrementally build within IREE and once it reaches certain level of maturity we could reconsider if it has value for MLIR.
One other thing to consider is the distribution of workgroups in dispatch regions, for SME it might be best for the region affinity and distribution to be on a single thread or serialized. The --iree-codegen-llvm-disable-distribution flag could be used for this and enabled by default for SSVE.
I think that flag is there mostly for testing/debugging purposes. However, we have target specific distribution (and vectorization/unrolling) configurations so it would be a matter of setting the tile sizes for distribution to zero for SSVE. Similar to what the flag does, actually.
In general the approach makes sense to me! Hopefully that helps! Let us know if we can help with anything else!
That's a very detailed and a well crafted RFC, thank you @c-rhodes !
To represent this in MLIR I would propose adding a pass that adds the aarch64_pstate_sm_enabled attribute to all functions
Wouldn't
aarch64_pstate_sm_body
work equally well in MLIR? I know that in IREE that's effectively the only option (aarch64_pstate_sm_enabled
changes the ABI, and we want to avoid that). Basically,aarch64_pstate_sm_body
feels like the "safe option" for now, regardless of whether focusing on IREE or MLIR. And we can always revisit later once this choice is too limiting. So, unless I am missing something, we could land this in MLIR as is :)
Thanks Andrzej. Yes it would work in MLIR but there are consequences to PSTATE.SM not being part of the interface highlighted above. The aarch64_pstate_sm_enabled
attribute is more suitable for calls between streaming functions, here an example comparing the codegen for each attribute for this:
define void @streaming_func1() #0 {
ret void
}
define void @streaming_func2() #0 {
call void @streaming_func1()
ret void
}
attributes #0 = { "aarch64_pstate_sm_enabled" }
compile
; llc -mtriple=aarch64-linux-gnu -mattr=+sme
streaming_func1: // @streaming_func1
ret
streaming_func2: // @streaming_func2
str x30, [sp, #-16]! // 8-byte Folded Spill
bl streaming_func1
ldr x30, [sp], #16 // 8-byte Folded Reload
ret
define void @streaming_func1() #0 {
ret void
}
define void @streaming_func2() #0 {
call void @streaming_func1()
ret void
}
attributes #0 = { "aarch64_pstate_sm_body" }
compile
; llc -mtriple=aarch64-linux-gnu -mattr=+sme
streaming_func1: // @streaming_func1
stp d15, d14, [sp, #-64]! // 16-byte Folded Spill
stp d13, d12, [sp, #16] // 16-byte Folded Spill
stp d11, d10, [sp, #32] // 16-byte Folded Spill
stp d9, d8, [sp, #48] // 16-byte Folded Spill
smstart sm
smstop sm
ldp d9, d8, [sp, #48] // 16-byte Folded Reload
ldp d11, d10, [sp, #32] // 16-byte Folded Reload
ldp d13, d12, [sp, #16] // 16-byte Folded Reload
ldp d15, d14, [sp], #64 // 16-byte Folded Reload
ret
streaming_func2: // @streaming_func2
stp d15, d14, [sp, #-80]! // 16-byte Folded Spill
stp d13, d12, [sp, #16] // 16-byte Folded Spill
stp d11, d10, [sp, #32] // 16-byte Folded Spill
stp d9, d8, [sp, #48] // 16-byte Folded Spill
str x30, [sp, #64] // 8-byte Folded Spill
smstart sm
smstop sm
bl streaming_func1
smstart sm
smstop sm
ldp d9, d8, [sp, #48] // 16-byte Folded Reload
ldp d11, d10, [sp, #32] // 16-byte Folded Reload
ldp d13, d12, [sp, #16] // 16-byte Folded Reload
ldr x30, [sp, #64] // 8-byte Folded Reload
ldp d15, d14, [sp], #80 // 16-byte Folded Reload
ret
In IREE there's no option but to use the internal attribute but I don't know if that's the case for MLIR.
Following up to my reply to the MLIR post…
Another option is to implement this as a pass in either MLIR or IREE
I think that at least the utility that annotates the functions with the attributes should be implemented and tested in MLIR. Also the lowering to LLVM. That would help the community align with the proposed SSME representation and build on top of that without reinventing the wheel. We don’t have to restrict it to IREE’s specific use case. The utility could take a function and a “streaming mode” and return the annotated function which would eventually be lowered to LLVM.
Thanks for the suggestion. I look at moving the pass to MLIR with an option to control which attribute is used. There's already a test covering the lowering to LLVM for the aarch64_pstate_sm_enabled
attribute I added recently: https://github.com/llvm/llvm-project/commit/c8d1388e6c8bd57299d5801f170719218f735c4c
Although it has now been made clear the passthrough mechanism is for prototyping so this will need updating.
This heuristic could be leveraged to selectively enable SSVE by integrating the MLIR pass into the relevant lowering configuration pipelines
+1. If you initially don’t feel very confident about these heuristics, this is something we can incrementally build within IREE and once it reaches certain level of maturity we could reconsider if it has value for MLIR.
I agree, this is inline with our thinking.
One other thing to consider is the distribution of workgroups in dispatch regions, for SME it might be best for the region affinity and distribution to be on a single thread or serialized. The --iree-codegen-llvm-disable-distribution flag could be used for this and enabled by default for SSVE.
I think that flag is there mostly for testing/debugging purposes. However, we have target specific distribution (and vectorization/unrolling) configurations so it would be a matter of setting the tile sizes for distribution to zero for SSVE. Similar to what the flag does, actually.
Thanks for pointing that out, I don't think this is an issue for us yet but that will be useful for when it is.
In general the approach makes sense to me! Hopefully that helps! Let us know if we can help with anything else!
Thanks again Diego!
Yes it would work in MLIR but there are consequences to PSTATE.SM not being part of the interface highlighted above. The
aarch64_pstate_sm_enabled
attribute is more suitable for calls between streaming functions, here an example comparing the codegen for each attribute for this:
Thank you for your thorough explanation and the CE link - cool to see this working upstream!
I look at moving the pass to MLIR with an option to control which attribute is used.
+1 to adding an option. In principle, it sounds like aarch64_pstate_sm_body
is both the easier and pretty much always correct option. But for completeness, we will support both:
aarch64_pstate_sm_body
(this is what you originally proposed for IREE), andaarch64_pstate_sm_enabled
(for when e.g. making calls between "streaming" functions).We are yet to understand the most efficient way of using the latter, but that's for later (Rome was not built in a day!).
Thanks for driving this!
Closing this now #13558 has been merged.
The Armv9 Scalable Matrix Extension (SME) defines a new "Streaming SVE" (SSVE) execution mode [1].
The purpose of this RFC is to evaluate what is required to support this mode in MLIR / IREE and propose next steps to address this. The focus of this RFC is SSVE only, not full SME support, but this is an important first step towards this.
SSVE is controlled by a processor state bit called PSTATE.SM (see section B1.1 in [1]). From [2]:
From [3]:
To support this feature there are two key problems to address:
This RFC is concerned with the former. The latter entails support for scalable vector enablement in MLIR / IREE which is outside of the scope of this RFC, but my colleague Andrzej Warzynski from Arm and Diego Caballero from Google posted an RFC on scalable vectorisation in Linalg last week that's relevant to this and worth a read.
In the LLVM backend PSTATE.SM is managed at the function boundary with function and callsite attributes. These were added in the enablement of the AArch64 SME Arm C Language Extensions (ACLE) [4]. The LLVM AArch64 SME docs [3] go into more detail on this.
To support SSVE in MLIR / IREE we can leverage these existing attributes to manage PSTATE.SM. [5] provides a list of attributes, for the purposes of this RFC only the SSVE attributes are relevant, these are:
The next section goes into detail for each of them and provides examples.
aarch64_pstate_sm_enabled
Calls to functions with this attribute will be wrapped with
smstart sm
/smstop sm
. For example, take the following LLVM IR:Compiled with:
Produces the following asm:
Streaming mode is enabled before the call to the streaming function and disabled after.
aarch64_pstate_sm_body
The backend will insert
smstart sm
/smstop sm
into the prologue/epilogue for functions with this attribute.For example, using the IR from the previous example with this attribute:
Produces the following asm:
Notice
smstart sm
/smstop sm
are emitted inside the streaming function, rather than around the call as with the previous attribute.The ACLE [6] provides more detail on this:
Since this is an internal attribute that's not visible to the caller, the compiler must disable streaming mode if enabled before calling a locally streaming function.
For example, the following IR:
Produces this asm:
The compiler has to emit unnecessary smstart/smstop instructions that in turn cause expensive spills / fills, since the CPU zeroes the FP/vector registers when changing PSTATE.SM.
aarch64_pstate_sm_compatible
Functions with this attribute stick to the streaming subset of SVE and are callable from both normal and streaming-mode. What this means in practice is the function is restricted to set of instructions that are legal in either mode. This is useful for compatibility in things like vector versions of math.h routines such as tanh.
Initial plan for SSVE support in MLIR/IREE
Streaming mode can be targeted from MLIR at the function boundary by specifying these attributes via the passthrough mechanism [7]. This works today with no changes.
Example MLIR:
Compile:
By leveraging existing attributes the backend is responsible for ensuring the code generator produces instructions that are legal in streaming mode.
To enable streaming mode we need a mechanism to add these attributes to functions. In IREE the attributes could initially be applied to all dispatch functions, and later be applied based on a more fine-grained heuristic. This would be predicated on
--iree-llvm-target-cpu-features=+sve,+sme
and could be done during LLVM lowering [8].Another option is to implement this as a pass in either MLIR or IREE. In IREE the
iree-llvmcpu-lower-executable-target
pass adds atranslation_info
attribute to each dispatch function that describes the lowering pipeline to use. The pipelines are defined in [9] and are selected based on the ops in the dispatch. For example, a dispatch with a Linalg convolution op will use theDispatchLoweringPassPipeline::CPUConvTileAndDecomposeExpert
pipeline. This heuristic could be leveraged to selectively enable SSVE by integrating the MLIR pass into the relevant lowering configuration pipelines [10].One other thing to consider is the distribution of workgroups in dispatch regions, for SME it might be best for the region affinity and distribution to be on a single thread or serialized. The
--iree-codegen-llvm-disable-distribution
flag could be used for this and enabled by default for SSVE.Which attribute to use?
The ABI [11] states:
In the context of IREE where the runtime calls dispatch functions, this suggests it is the responsibility of the runtime. However, the runtime doesn't know the details of dispatches such that it could emit these instructions. If a pass is introduced that adds the
aarch64_pstate_sm_enabled
attribute to dispatch functions it's effectively changing the function's ABI forcing the caller (IREE runtime) to be responsible for managing PSTATE.SM before entry/exit. Given the current limitation of the runtime this attribute cannot be used.Streaming-mode must instead be managed on function entry/exit. The
aarch64_pstate_sm_body
attribute can be used for this. There may be scope for adding backend intrinsics forsmstart
/smstop
in the future that provide better granularity (i.e. not restricted to function boundary) for managing PSTATE.SM, but it is worth emphasising one of the big advantages to using the existing attributes is the constraint to only use instructions that are legal in streaming mode is handled by the backend. This will help to address the second key problem mentioned earlier.Another consideration is the runtime SVE vector length may differ in streaming mode. This is mentioned in the list of restrictions for these attributes [12]:
[13] contains further info on this:
The same restriction should be present in the pass.
The streaming-compatible interface doesn't seem particularly useful given the management of streaming-mode will be done in the MLIR/IREE compiler vs by the user in C code. We can make full use of the features in either mode.
Proposed next steps
I've shared a patch with a pass in IREE that adds the
aarch64_pstate_sm_body
to functions. The pass is enabled for AArch64 when SVE(2) and SME are enabled for the following lowering configurations:These configurations were chosen simply because they're used in one of our pipelines.
In the patch I shared I've added the pass to IREE rather than MLIR because it uses the internal backend attribute to enable SSVE to avoid adding support to the IREE runtime.
Any questions or suggestions would be appreciated. Thanks for reading!
Links
[1] https://developer.arm.com/documentation/ddi0616/latest [2] https://arm-software.github.io/acle/main/acle.html#introduction-to-streaming-and-non-streaming-mode [3] https://llvm.org/docs/AArch64SME.html#handling-pstate-sm [4] https://arm-software.github.io/acle/main/acle.html#sme-language-extensions-and-intrinsics [5] https://llvm.org/docs/AArch64SME.html#introduction [6] https://arm-software.github.io/acle/main/acle.html#changing-streaming-mode-locally [7] https://mlir.llvm.org/docs/Dialects/LLVM/#attribute-pass-through [8] https://github.com/openxla/iree/blob/4952c50057a78ef86fc92c17742b0e2674df5964/compiler/src/iree/compiler/Dialect/HAL/Target/LLVMCPU/LLVMCPUTarget.cpp#L231-L245 [9] https://github.com/openxla/iree/blob/4952c50057a78ef86fc92c17742b0e2674df5964/compiler/src/iree/compiler/Codegen/Dialect/LoweringConfig.td [10] https://github.com/openxla/iree/blob/4952c50057a78ef86fc92c17742b0e2674df5964/compiler/src/iree/compiler/Codegen/LLVMCPU/LLVMCPULowerExecutableTarget.cpp#L187 [11] https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#671pstatesm-interfaces [12] https://llvm.org/docs/AArch64SME.html#restrictions-on-attributes [13] https://llvm.org/docs/AArch64SME.html#functions-with-attribute-arm-locally-streaming