rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.76k stars 12.76k forks source link

How should we handle matrix ABIs? #133144

Open workingjubilee opened 4 days ago

workingjubilee commented 4 days ago

Some CPU architectures have developed "matrix extensions". These are sometimes equivalent to "vectors, but bigger" in terms of how the ABI should be handled (reusing the same architectural state, thus having similar concerns). But not always! They may use entirely different architectural state, usually entirely "caller-save" (i.e. always "volatile" or "call-clobbered").

AArch64

Scalable Matrix Extensions

PowerPC

MMA

x86

AMX

References

programmerjake commented 4 days ago

afaik PowerPC MMA doesn't change the ABI: https://github.com/rust-lang/rust/issues/131800#issuecomment-2418749961

workingjubilee commented 4 days ago

It is good this issue is about handling ABIs rather than merely describing them, then? specifically, if we want to avoid involving this in our ABIs, we need to adopt the same bans.

RalfJung commented 4 days ago

How does LLVM even represent these types in function signatures?

Sounds to me like this will require repr(matrix) and corresponding dedicated logic everywhere?

workingjubilee commented 3 days ago

I'm not sure if there's much in common that would justify repr(matrix). Each ISA might just require boutique handling here. But I am still trying to understand how Power ISA's MMA, Arm's Scalable Matrix Extensions, and x86's AMX tiles work, and how we will want to represent them.

My current understanding is

PowerISA's Matrix Multiply Assist

C Interop

Intrinsics

Arm Scalable Matrix Extensions

It is almost more like a dedicated thread-local allocation... the "ZArray"... that gets reinterpreted or examined along various dimensions. Then you set the CPU into Matrix Math... sorry, "Arm Streaming SVE" state... and Big Array Math happens, accumulating into the ZArray. The Big Array Math however is expressible as vector operations that just might use a different size than the normal Arm SVE operations, which is why it's "Streaming SVE": the model is "matrix math is mostly a pile of vector operations, done really fast". This does remove the ability to use some of the more complicated Arm SVE2 operations while in it.

C Interop

x86 AMX Tiles

The tiles seem to be more "classic" registers, but use an interesting API. They are also "shape-changing" in a way. I assume @sayantn knows more about this.

C Interop

Intrinsics