How should we handle matrix ABIs?

workingjubilee commented 4 days ago

Some CPU architectures have developed "matrix extensions". These are sometimes equivalent to "vectors, but bigger" in terms of how the ABI should be handled (reusing the same architectural state, thus having similar concerns). But not always! They may use entirely different architectural state, usually entirely "caller-save" (i.e. always "volatile" or "call-clobbered").

AArch64

Scalable Matrix Extensions

PowerPC

MMA

https://github.com/rust-lang/rust/issues/131800#issuecomment-2418346013

x86

AMX

https://github.com/rust-lang/rust/issues/126622
introduces the amx_tile type, AKA x86_amx or __tile1024i

References

https://github.com/rust-lang/rust/issues/131800

programmerjake commented 4 days ago

afaik PowerPC MMA doesn't change the ABI: https://github.com/rust-lang/rust/issues/131800#issuecomment-2418749961

workingjubilee commented 4 days ago

It is good this issue is about handling ABIs rather than merely describing them, then? specifically, if we want to avoid involving this in our ABIs, we need to adopt the same bans.

RalfJung commented 4 days ago

How does LLVM even represent these types in function signatures?

Sounds to me like this will require repr(matrix) and corresponding dedicated logic everywhere?

workingjubilee commented 3 days ago

I'm not sure if there's much in common that would justify repr(matrix). Each ISA might just require boutique handling here. But I am still trying to understand how Power ISA's MMA, Arm's Scalable Matrix Extensions, and x86's AMX tiles work, and how we will want to represent them.

My current understanding is

PowerISA's Matrix Multiply Assist

__vector_pair and __vector_quad are the relevant types
__vector_quad represents the accumulator register

C Interop

According to clang the __vector_quad type should never be passed anywhere?

Intrinsics

The __vector_quad type is always handled by-pointer.
The __vector_pair type seems to be defined as opaque(?) yet is sometimes passed by-value to intrinsics.

Arm Scalable Matrix Extensions

It is almost more like a dedicated thread-local allocation... the "ZArray"... that gets reinterpreted or examined along various dimensions. Then you set the CPU into Matrix Math... sorry, "Arm Streaming SVE" state... and Big Array Math happens, accumulating into the ZArray. The Big Array Math however is expressible as vector operations that just might use a different size than the normal Arm SVE operations, which is why it's "Streaming SVE": the model is "matrix math is mostly a pile of vector operations, done really fast". This does remove the ability to use some of the more complicated Arm SVE2 operations while in it.

C Interop

SME2: there is probably an assumption about what state the ZArray is on procedure entry/exit, likely "none, that's caller-saved"
SME: there is probably an assumption about whether the CPU is in "Matrix Math" or "Vector Math" states on procedure entry/exit, and it is probably "Vector Math" ("Non-Streaming") state
otherwise, it basically just seems to use the same vector registers, so there's that mercy

x86 AMX Tiles

The tiles seem to be more "classic" registers, but use an interesting API. They are also "shape-changing" in a way. I assume @sayantn knows more about this.

C Interop

there is probably an assumption about what state the tiles are in on procedure entry/exit (also probably "none, caller-saved")
there is probably an assumption about what shape the tiles are in on procedure entry/exit

Intrinsics

The __tile1024i type seems to be passed both by-value and handled by-pointer, for a typical signature looking like this:
```
fn some_tile_intrinsic(dst: &mut __tile1024i, src_a: __tile1024i, src_b: __tile1024i)
```

rust-lang / rust