rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
94.73k stars 12.21k forks source link

LLVM can't support partial register clobbers #126713

Open jstarks opened 1 week ago

jstarks commented 1 week ago

On x86_64-pc-windows-msvc, the system ABI considers xmm0-5 as volatile/caller-save registers, and xmm6-15 as non-volatile/callee-save registers. However, clobber_abi("C") incorrectly marks xmm6-15 as clobbered, which produces a bunch of wasteful register saves/restores.

This behaves according to the documentation (which specifies that xmm6-15 will be clobbered), but it's almost certainly not what the developer wanted due to the unnecessary clobbering causing performance problems. The workaround is to avoid using clobber_abi on Windows and to manually clobber registers; this is a pain and may cause problems down the line as more volatile registers are added to the ABI.

So a breaking change seems warranted here.

[Godbolt link](https://godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(filename:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,selection:(endColumn:24,endLineNumber:5,positionColumn:24,positionLineNumber:5,selectionStartColumn:24,selectionStartLineNumber:5,startColumn:24,startLineNumber:5),source:'%23%5Bno_mangle%5D%0Apub+fn+foo()+%7B%0A++++unsafe+%7B%0A++++++++std::arch::asm!!+%7B%0A++++++++++++%22call+bar%22,%0A++++++++++++clobber_abi(%22C%22),%0A++++++++%7D%0A++++%7D%0A%7D%0A'),l:'5',n:'1',o:'Rust+source+%231',t:'0')),k:33.333333333333336,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:r1790,filters:(b:'1',binary:'1',binaryObject:'1',commentOnly:'1',debugCalls:'1',demangle:'0',directives:'1',execute:'1',intel:'0',libraryCode:'1',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:rust,libs:!(),options:'--target+x86_64-pc-windows-msvc+-C+opt-level%3D3',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+rustc+1.79.0+(Editor+%231)',t:'0')),k:33.333333333333336,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x64+msvc+v19.latest',editorid:1,fontScale:14,fontUsePx:'0',j:1,wrap:'1'),l:'5',n:'0',o:'Output+of+rustc+1.79.0+(Compiler+%231)',t:'0')),k:33.33333333333333,l:'4',n:'0',o:'',s:0,t:'0')),l:'2',n:'0',o:'',t:'0')),version:4)

#[no_mangle]
pub fn foo() {
    unsafe {
        std::arch::asm! {
            "call bar",
            clobber_abi("C"),
        }
    }
}

Asm:

foo:
        sub     rsp, 168
        movaps  xmmword ptr [rsp + 144], xmm15
        movaps  xmmword ptr [rsp + 128], xmm14
        movaps  xmmword ptr [rsp + 112], xmm13
        movaps  xmmword ptr [rsp + 96], xmm12
        movaps  xmmword ptr [rsp + 80], xmm11
        movaps  xmmword ptr [rsp + 64], xmm10
        movaps  xmmword ptr [rsp + 48], xmm9
        movaps  xmmword ptr [rsp + 32], xmm8
        movaps  xmmword ptr [rsp + 16], xmm7
        movaps  xmmword ptr [rsp], xmm6

        call    bar

        movaps  xmm6, xmmword ptr [rsp]
        movaps  xmm7, xmmword ptr [rsp + 16]
        movaps  xmm8, xmmword ptr [rsp + 32]
        movaps  xmm9, xmmword ptr [rsp + 48]
        movaps  xmm10, xmmword ptr [rsp + 64]
        movaps  xmm11, xmmword ptr [rsp + 80]
        movaps  xmm12, xmmword ptr [rsp + 96]
        movaps  xmm13, xmmword ptr [rsp + 112]
        movaps  xmm14, xmmword ptr [rsp + 128]
        movaps  xmm15, xmmword ptr [rsp + 144]
        add     rsp, 168
        ret

I expected that foo would just establish a frame and call bar without saving/restoring of the XMM registers.

Meta

rustc --version --verbose:

rustc 1.78.0 (9b00956e5 2024-04-29)
binary: rustc
commit-hash: 9b00956e56009bab2aa15d7bff10916599e3d6d6
commit-date: 2024-04-29
host: x86_64-unknown-linux-gnu
release: 1.78.0
LLVM version: 18.1.2

But this repros in nightly.

workingjubilee commented 1 week ago

Thank you for pointing this out!

It feels dubious to call it a "breaking change" to apply this diff:

 /// Always returns true.
 pub fn returns_true() -> bool {
+    true
-    false
 }

But regardless of whether it is or isn't, I agree that clobber_abi should definitely do what it says it does, because it is very important that unsafe code uphold its contracts exactly. We should definitely fix this soon, considering that we're about to see x86 architectures with 32 GPRs!

jstarks commented 1 week ago

It looks like this was known by the author (@Amanieu):

            InlineAsmClobberAbi::X86_64Win => clobbered_regs! {
                X86 X86InlineAsmReg {
                    // rdi and rsi are callee-saved on windows
                    ax, cx, dx, r8, r9, r10, r11,

                    // xmm6-xmm15 are callee-saved on windows, but we need to
                    // mark them as clobbered anyways because the upper portions
                    // of ymm6-ymm15 are volatile.
                    xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
                    xmm8, xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15,
                    zmm16, zmm17, zmm18, zmm19, zmm20, zmm21, zmm22, zmm23,
                    zmm24, zmm25, zmm26, zmm27, zmm28, zmm29, zmm30, zmm31,

                    k0, k1, k2, k3, k4, k5, k6, k7,

                    mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7,
                    st0, st1, st2, st3, st4, st5, st6, st7,
                    tmm0, tmm1, tmm2, tmm3, tmm4, tmm5, tmm6, tmm7,
                }
            },
workingjubilee commented 1 week ago

Oh, right!

I don't think we have a reasonable way to mark "only half a register" as clobbered!

workingjubilee commented 1 week ago

Does LLVM have a way to express these partial clobbers in its assembly language?

Amanieu commented 1 week ago

And if you look just below, you'll see we have the same issue on AArch64.

                AArch64 AArch64InlineAsmReg {
                    x0, x1, x2, x3, x4, x5, x6, x7,
                    x8, x9, x10, x11, x12, x13, x14, x15,
                    x16, x17, x18, x30,

                    // Technically the low 64 bits of v8-v15 are preserved, but
                    // we have no way of expressing this using clobbers.
                    v0, v1, v2, v3, v4, v5, v6, v7,
                    v8, v9, v10, v11, v12, v13, v14, v15,
                    v16, v17, v18, v19, v20, v21, v22, v23,
                    v24, v25, v26, v27, v28, v29, v30, v31,

                    p0, p1, p2, p3, p4, p5, p6, p7,
                    p8, p9, p10, p11, p12, p13, p14, p15,
                    ffr,

                }

This is unfortunately a limitation of LLVM: it doesn't give us a way to indicate only half of a register is clobbered.

workingjubilee commented 1 week ago

Yeah, ideally there would be a special "vzeroupper" clobber but there isn't one, it seems.

workingjubilee commented 1 week ago

So it seems this is not a Windows-specific issue! It also is less of a concern for the x86 extension I mentioned, since that doesn't change the definition of those registers: rather, they only exist after not-existing!

workingjubilee commented 1 week ago

It looks like the Windows x64 version of this bug is already reported in https://github.com/llvm/llvm-project/issues/50566

beetrees commented 1 week ago

Regardless of if/when LLVM gets fixed to handle partial clobbers, the documentation could still be updated to only guarantee that the upper portions of y/zmm6-15 are marked as clobbered by clobber_abi("C"). The current implementation would still be valid, as the compiler is free to save and restore more registers (or parts of registers) than the asm! author clobbered.