[AArch64] [Windows] Functions using SVE can fail "WinCFI not supported with SVE vectors"

mstorsjo commented 5 months ago

Compiling AArch64 code that uses SVE features, for a Windows target, can seem to work at first, but fail later when functions become more complex.

This has been observed in libaom (https://aomedia.googlesource.com/aom), since https://aomedia.googlesource.com/aom/+/04b91c17bf976b0616ab94fa6cd36892d47e9ce5%5E%21/, when compiled for an aarch64-mingw target.

When the SVE functions end up needing to back up and restore SVE vectors, they trigger the failed "WinCFI not supported with SVE vectors" assert.

This can be triggered with this reduced code snippet from libaom:

#include <arm_neon_sve_bridge.h>
int64x2_t a(int64x2_t acc, int16x8_t b, int16x8_t c) {
  return svget_neonq_s64(svdot_s64(svset_neonq_s64(svundef_s64(), acc),
                                   svset_neonq_s16(svundef_s16(), b),
                                   svset_neonq_s16(svundef_s16(), c)));
}
void e();
int16x8_t d(uint16x8x2_t g) {
  int16x8_t f[8];
  e();
  int16x8_t j = vextq_s16(g.val[0], g.val[1], 0);
  int64x2_t i = a(vdupq_n_s64(0), j, f[0]);
  int64x2_t l;
  int64x2_t k = vpaddq_s64(i, l);
  int64x2_t m;
  int32x4_t n = vcombine_s32(vmovn_s64(k), vmovn_s64(m));
  int32x4_t o;
  return vcombine_s16(vmovn_s32(n), vmovn_s32(o));
}

$ clang -target aarch64-windows-gnu -c repro.c -march=armv8-a+sve -O2
clang: ../lib/Target/AArch64/AArch64InstrInfo.cpp:5444: void llvm::emitFrameOffset(llvm::MachineBasicBlock&, llvm::MachineBasicBlock::iterator, const llvm::DebugLoc&, unsigned int, unsigned int, llvm::StackOffset, const llvm::TargetInstrInfo*, llvm::MachineInstr::MIFlag, bool, bool, bool*, bool, llvm::StackOffset, unsigned int): Assertion `!(NeedsWinCFI && (NumPredicateVectors || NumDataVectors)) && "WinCFI not supported with SVE vectors"' failed.

It can also be reproduced with a more targeted handwritten testcase:

#include <arm_sve.h>
void other(void);
void func(svfloat32_t a) {
  other();
}

$ clang -target aarch64-windows-gnu -c repro.c -march=armv8-a+sve
clang: ../lib/Target/AArch64/AArch64InstrInfo.cpp:5444: void llvm::emitFrameOffset(llvm::MachineBasicBlock&, llvm::MachineBasicBlock::iterator, const llvm::DebugLoc&, unsigned int, unsigned int, llvm::StackOffset, const llvm::TargetInstrInfo*, llvm::MachineInstr::MIFlag, bool, bool, bool*, bool, llvm::StackOffset, unsigned int): Assertion `!(NeedsWinCFI && (NumPredicateVectors || NumDataVectors)) && "WinCFI not supported with SVE vectors"' failed.

I guess there's not much we can do about this, until MS specifies SEH unwind opcodes for dealing with scalable vectors, or until they specify something that differs from AAPCS64 which scalable vector registers need to be preserved in the Windows calling convention.

CC @efriedma-quic

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-aarch64

Author: Martin Storsjö (mstorsjo)

Compiling AArch64 code that uses SVE features, for a Windows target, can seem to work at first, but fail later when functions become more complex. This has been observed in libaom (https://aomedia.googlesource.com/aom), since https://aomedia.googlesource.com/aom/+/04b91c17bf976b0616ab94fa6cd36892d47e9ce5%5E%21/, when compiled for an aarch64-mingw target. When the SVE functions end up needing to back up and restore SVE vectors, they trigger the failed "WinCFI not supported with SVE vectors" assert. This can be triggered with this reduced code snippet from libaom: ```c #include <arm_neon_sve_bridge.h> int64x2_t a(int64x2_t acc, int16x8_t b, int16x8_t c) { return svget_neonq_s64(svdot_s64(svset_neonq_s64(svundef_s64(), acc), svset_neonq_s16(svundef_s16(), b), svset_neonq_s16(svundef_s16(), c))); } void e(); int16x8_t d(uint16x8x2_t g) { int16x8_t f[8]; e(); int16x8_t j = vextq_s16(g.val[0], g.val[1], 0); int64x2_t i = a(vdupq_n_s64(0), j, f[0]); int64x2_t l; int64x2_t k = vpaddq_s64(i, l); int64x2_t m; int32x4_t n = vcombine_s32(vmovn_s64(k), vmovn_s64(m)); int32x4_t o; return vcombine_s16(vmovn_s32(n), vmovn_s32(o)); } ``` ```console $ clang -target aarch64-windows-gnu -c repro.c -march=armv8-a+sve -O2 clang: ../lib/Target/AArch64/AArch64InstrInfo.cpp:5444: void llvm::emitFrameOffset(llvm::MachineBasicBlock&, llvm::MachineBasicBlock::iterator, const llvm::DebugLoc&, unsigned int, unsigned int, llvm::StackOffset, const llvm::TargetInstrInfo*, llvm::MachineInstr::MIFlag, bool, bool, bool*, bool, llvm::StackOffset, unsigned int): Assertion `!(NeedsWinCFI && (NumPredicateVectors || NumDataVectors)) && "WinCFI not supported with SVE vectors"' failed. ``` It can also be reproduced with a more targeted handwritten testcase: ```c #include <arm_sve.h> void other(void); void func(svfloat32_t a) { other(); } ``` ```console $ clang -target aarch64-windows-gnu -c repro.c -march=armv8-a+sve clang: ../lib/Target/AArch64/AArch64InstrInfo.cpp:5444: void llvm::emitFrameOffset(llvm::MachineBasicBlock&, llvm::MachineBasicBlock::iterator, const llvm::DebugLoc&, unsigned int, unsigned int, llvm::StackOffset, const llvm::TargetInstrInfo*, llvm::MachineInstr::MIFlag, bool, bool, bool*, bool, llvm::StackOffset, unsigned int): Assertion `!(NeedsWinCFI && (NumPredicateVectors || NumDataVectors)) && "WinCFI not supported with SVE vectors"' failed. ``` I guess there's not much we can do about this, until MS specifies SEH unwind opcodes for dealing with scalable vectors, or until they specify something that differs from AAPCS64 which scalable vector registers need to be preserved in the Windows calling convention. CC @efriedma-quic

efriedma-quic commented 5 months ago

AAPCS says "If a subroutine takes at least one argument in scalable vector registers or scalable predicate registers, or if it is a function that returns results in such registers, it must ensure that the entire contents of z8-z23 are preserved across the call. In other cases it need only preserve the low 64 bits of z8-z15." MSVC docs say we follow AAPCS rules for argument passing, and we can't preserve those registers without corresponding unwind codes... so we're stuck. There's no way to correctly emit a function with an argument of SVE type.

I guess in cases where we're only emitting the unwind info for debugging, we could add a hack to skip the SVE spills, but that's pretty ugly.

Either we need new unwind codes, or we need a different ABI on Windows targets. A different ABI isn't that hard to implement... but I don't really want to do that without some guidance from Microsoft.

CC @dpaoliello

mstorsjo commented 5 months ago

In case we don't get a good way forward for handling this, and more code in the wild will try to use SVE when the compiler supports it, we probably should error out in these cases in a more graceful way than a failed assert. Ideally even in a way that a project's build scripts can easily deduce that SVE isn't supported, before running into it while compiling the code. (OTOH, with asserts disabled, I guess it would compile just fine even if the unwind info doesn't match.) I wonder whether it makes sense to try to support compiling SVE in the limited cases as long as AAPCS doesn't say we need to back up those registers... Just disallowing all of SVE on Windows targets might be the safest alternative?

mstorsjo commented 2 months ago

Also, for reference - if compiling with a Clang that doesn't have asserts enabled, the error one hits is a different one:

error: Incorrect size for av1_highbd_warp_affine_sve epilogue: 48 bytes of instructions in range, but .seh directives corresponding to 44 bytes

pmsjt commented 2 months ago

Just as it is the case for the original Arm64 calling convention, Window's SVE calling convention does not add any new callee-saved registers. A corollary of this is that no new unwind opcodes are necessary for SVE. This does not mean that all use of SVE is excused from saving/restoring and associated unwinding information. The lower 128 bits of SVE registers are overlayed on Neon registers. The lower 64-bits of V8 through V15 are callee-saved. This means that any function using SVE registers Z8 through Z15 must save the lower 64-bits of the associated Neon registers and decorate such operations with the associated unwind ops, just as if they were being used as Neon.

efriedma-quic commented 2 months ago

That mostly solves the problem, I guess.

The other issue is that there isn't any opcode corresponding to addvl, which you need if you have SVE stack temporaries and frame-pointer elimination. We can work around that by forcing a frame pointer, but it's a bit of a pain to implement.

pmsjt commented 2 months ago

The ADDVL instruction allocates an amount of stack that is only know at runtime. This is no different than the use of _alloca() and, yes, implies the use of a frame pointer.

efriedma-quic commented 2 months ago

LLVM normally uses a stack layout that looks something like this:

stack
caller arguments
callee-saves
SVE temporaries
non-SVE temporaries
end of stack (SP)

This is fine without a frame pointer, as long as you can represent it in the unwind info.

pmsjt commented 2 months ago

And that is still perfectly reasonable as long as the size is known a priori. Once the size is not known at compile time, it becomes harder to dismiss the frame pointer use. SVE has that about it. We could introduce a new opcode which is like alloc_sve - a cousin of alloc_small but with the argument being a multiple of vector size. But that raises multiple hurdles, such as having the unwinder know about the vector length even when processing stacks offline, or cross thread, like in a dump or garbage collection. Given that the use of frame pointers is strongly recommended, this sounds very unnecessary and borderline redundant.

efriedma-quic commented 2 months ago

The way the LLVM implementation is structured, forcing a frame pointer is a bit tricky to implement. We want to avoid adding frame pointers to every function (particularly small leaf functions)... but we don't know whether there are scalable spill slots until after register allocation. I'm sure we can come up with some solution, though. Maybe we can detect whether the code refers to any SVE registers before register allocation, and reserve the frame pointer register during register allocation if it does.

Do you have a timeline for updating the official documentation? I'd prefer not to point to comments on the issue tracker in the commit message...

pmsjt commented 2 months ago

How is _alloca() implemented?

efriedma-quic commented 2 months ago

In SelectionDAG, when we lower a variable alloca, we record that, and that force-enables frame pointer emission. So we can compute whether we need a frame pointer before it becomes relevant. If you have local variables of SVE type, that could work basically the same way. The tricky part is spill slots generated by the register allocator. We don't know if the register allocator will spill any SVE registers until after register allocation... but we need to reserve the frame pointer register before register allocation so the allocator doesn't use it.

So if we want to avoid unconditionally reserving the frame pointer register, we have to implement a heuristic to predict whether the register allocator will generate any SVE spill slots.

pmsjt commented 1 week ago

Hypothetical question to gage the pulse on options: Assuming Windows adds support for two new unwind opcodes. Let's call them save_regz for callee-saved registers (for the case when functions have SVE types in the signature) and alloc_z for generic, local, stack-bound SVE variables. How hard would it be to add the requirement that sudo-register 46 (VG) is always spilled every time one of the alloc_z or save_regz is opcodes are also used? This reduces a lot of complexity that has to do with determining VL in "offline" cases, such as dump analysis and remote debugging (not to mention the SSVE case, where spilling VG is already a requirement). What do you think?

efriedma-quic commented 1 week ago

See #83301 for what we ended up doing for SME on ELF targets.

If you have a stack dump, that includes all the registers as well, including the current value of VG. So I'm not sure what benefit you get from storing the VG in functions that don't modify it. That said, I don't see any complexity involved in emitting such a store; when we finalize the stack layout, we can compute whether the size of the stack frame depends on the VG.

I think you'd want the VG to be saved for all stack frames where the size of any spill/variable depends on the VG, not just stack frames where it's part of the prologue. I mean, you can do unwinding without the VG, but ideally you also want to allow your debugger to compute the values of local variables.

pmsjt commented 1 week ago

I think you are agreeing with me that having VG always saved is beneficial. Even if there are cases where the unwinder could infer VG by the distance between FP and SP and divide the Z saved area by the number of Zs spilled, you can always end up functions using alloca() where SP is free to move around by amounts that are only known at runtime.

Certainly, in cases where you are, for example, looking at a kernel dump where each process or thread can have a different VG, it might not be trivial to determine the VG of the userland stack. For purely userland dumps, I agree that the VG can probably be derived from the context representation for each thread, which should include VL.

efriedma-quic commented 1 week ago

I'm not sure I understand the context of "looking at a kernel dump"; if you don't have access to a register context, you can't unwind the stack in the traditional sense. How do you even know which part of memory is the stack, at that point? If you're reconstructing which parts of memory are stack frames based on heuristics, the contents of random SVE vector variable doesn't seem important.

I mean, if you can get someone writing such a heuristic unwinding tool to say having the VG would be useful, then fine, but I'm really not seeing it.

pmsjt commented 1 week ago

There are cases where threads can go into the kernel without saving any state. For example, for a syscall, where it is expected all SVE state to be scrubbed. But, even in that case, we must at least restore the VL for that thread and that is saved on our KTHREAD structure. So, you are right: the kernel debugger can potentially parse the KTHREAD (already has to for other reasons) and retrieve that thread's SVE VL and provide it to the unwind / stack backtrace process.

llvm / llvm-project

[AArch64] [Windows] Functions using SVE can fail "WinCFI not supported with SVE vectors" #80009