Add support for aarch64 SVE ukernel

per commented 1 year ago

Request description

For example AWS Graviton 3, based on Arm Neoverse-V1 CPUs, has support for SVE (Scalable Vector Extension). We want to add support for SVE ukernel and apart from the mmt4d kernel, also address the tiling to be decided based on the vector length. For the tiling the plan is to re-use parts of the mechanisms for the vmvx backend, to have the tile sizes be decided at runtime.

What component(s) does this issue relate to?

Compiler

Additional context

No response

dcaballe commented 1 year ago

@bjacob, @banach-space

banach-space commented 1 year ago

Great to see this gaining traction! 🚀

have the tile sizes be decided at runtime

Wouldn't this be enabled through parametric tiling? That's already supported by the tiling infra in Linalg and, AFAIK, "just works" ™️ :) (*) It hasn't been wired-up in IREE yet, but I hope to have this working in the coming week. That might solve this particular problem for you.

Having said that, I'm not familiar with the u-kernel lowering path, so might be missing something obvious. Perhaps you'd be enabling this elsewhere?

-Andrzej

(*) I experimented with that when preparing the scalable vec RFC, see scalable tiling.

bjacob commented 1 year ago

@benvanik @MaheshRavishankar @hanhanW

bjacob commented 1 year ago

Sounds like the kind of topic that needs a video call with all of us :-)

bjacob commented 1 year ago

The complicated part is the dynamic tile size selection. I wonder if this project could be split into two stages: first work with compile-time-specified SVE vector length, implement the SVE ukernels in that context ; then tackle the dynamic vector length aspect.

MaheshRavishankar commented 1 year ago

There are a few things to untangle here. Would be great to sync up... I am going to be taking some personal time, but I can connect on Monday August 28th.

banach-space commented 1 year ago

The complicated part is the dynamic tile size selection.

Just to clarify, SVE supports scalable vectors, but does not support "dynamic vectors" or "vector register grouping" - that's something that's available in the other CPU architecture that supports scalable vectors ;-)

Now, while the effective vector length is not known at compile time, it is known (and fixed) at run-time. So there is no "dynamism" here. This is crucial, because ultimately we only have to replace expressions like:

"a vector that contains 4 elements", with:
"a vector that contains 4 * vscale elements".

Yes, we don't know the value of vscale at compile time. However, we can still refer to it as any other SSA value (that's where "parametric" becomes important):

I wonder if this project could be split into two stages: first work with compile-time-specified SVE vector length, implement the SVE ukernels in that context ; then tackle the dynamic vector length aspect.

That's an option, but the first step would be no different to simply replacing NEON kernels with SVE, right? IMHO, we should be aiming for "scalability" in the first iteration of this. Ultimately, that's the key feature of SVE that we are trying to enable through scalable vectorisation (*). Also, once we consider SME as well (and, directly related to this, Streaming SVE), we will start mixing different vector sizes in one compilation. That's because the runtime value of vscale will very likely differ between non-streaming and streaming SVE (this will depend on the actual implementation of SME).

I am going to be taking some personal time, but I can connect on Monday August 28th.

Also available on Monday, Tue (preferred) and Weds next week.

-Andrzej

(*) We can, of course, treat u-kernels and scalable vectorisation separately.

per commented 1 year ago

@bjacob Yes, agree. It can be split up in two parts, where the second part is the trickier one wrt to getting the tiling handling in place.

I'm also available Monday or Tuesday (after 6:30 CET) for a call.

bjacob commented 1 year ago

Just to clarify,

Thanks a lot @banach-space for the explanation.

That's an option, but the first step would be no different to simply replacing NEON kernels with SVE, right? IMHO, we should be aiming for "scalability" in the first iteration of this. Ultimately, that's the key feature of SVE that we are trying to enable through scalable vectorisation (*). (*) We can, of course, treat u-kernels and scalable vectorisation separately

Yes, that's what I had in mind: with the split, the first half of the project still allows writing the final ukernels code. It's only the compiler side that is tricky to make work with the runtime tile size --- so after the first stage is completed, the compiler still treats vscale as a compile-time constant, but the ukernels treat it as a runtime value, if you want, or whatever --- ukernels are easy to evolve and if you run into any friction with the current code, we can change anything.

iree-org / iree