Clobbered XMM registers are not preserved around Intel-style inline assembly blocks in MS-ABI functions

llvmbot commented 3 years ago


Bugzilla Link	51222
Version	12.0
OS	All
Reporter	LLVM Bugzilla Contributor

Extended Description

The issue is first observed with clang 10.0 bundled with MS Visual Studio 2019 on windows, but later confirmed with clang 7.0.1 on Linux (CentOS 7.7) and with clang 12.0 bundled with Xcode 12.2 on Mac OS.

Here is the minimal reproducible example:

    void test(void)
    {
        __asm
        {
            VPXOR YMM6, YMM6, YMM6
        }
    }

When compiled on windows with

    clang-cl /O2 /FA -c test.cpp

it produces the following assembly (meta-information skipped for clarity)

    #APP
    vpxor   ymm6, ymm6, ymm6
    #NO_APP
    ret

As you can see XMM6 is not preserved even though it is clobbered by vpxor instruction.

If I pass the -mavx2 flag to the compiler, however

    clang-cl /O2 -mavx2 /FA -c test.cpp

the produced assembly turns into

    sub rsp, 24
    vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
    #APP
    vpxor   ymm6, ymm6, ymm6
    #NO_APP
    vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
    add rsp, 24
    vzeroupper
    ret

XMM6 is now preserved.

The same issue is present on Linux and Mac OS. However ms_abi must be explicitly stated now:

    void __attribute__((ms_abi)) test(void)
    {
        __asm
        {
            VPXOR YMM6, YMM6, YMM6
        }
    }

Compiling on Linux with

    clang -O2 -fasm-blocks -S test.cpp

produces

    #APP
    vpxor   %ymm6, %ymm6, %ymm6
    #NO_APP
    retq

Compiling with

    clang -O2 -mavx2 -fasm-blocks -S test.cpp

produces

    subq    $24, %rsp
    vmovaps %xmm6, (%rsp)           # 16-byte Spill
    #APP
    vpxor   %ymm6, %ymm6, %ymm6
    #NO_APP
    vmovaps (%rsp), %xmm6           # 16-byte Reload
    addq    $24, %rsp
    vzeroupper
    retq

Compiling on Mac OS with

    /System/Volumes/Data/Applications/Xcode_12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -O2 -fasm-blocks -S test.cpp

produces

    pushq   %rbp
    movq    %rsp, %rbp
    ## InlineAsm Start
    vpxor   %ymm6, %ymm6, %ymm6
    ## InlineAsm End
    popq    %rbp
    retq

Compiling with

    /System/Volumes/Data/Applications/Xcode_12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -O2 -mavx2 -fasm-blocks -S test.cpp

produces

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $16, %rsp
    vmovaps %xmm6, -16(%rbp)        ## 16-byte Spill
    ## InlineAsm Start
    vpxor   %ymm6, %ymm6, %ymm6
    ## InlineAsm End
    vmovaps -16(%rbp), %xmm6        ## 16-byte Reload
    addq    $16, %rsp
    popq    %rbp
    vzeroupper
    retq

Additional comments and observations.

The issue only happens with Intel-style assembly blocks. Using gcc-style inline assembly and explicitly mentioning the registers in the clobber list produces the correct code.
The real world code, of course, is much more involved and contains cpuid-based branches for avx2 and non-avx2 platforms. That means that we must compile without the -mavx2 switch to support both.

llvmbot commented 3 years ago

In addition to the scenarios above, below is another problem case that we observe. This time many xmm registers are preserved while they should not be.

void test(void)
{
    __asm
    {
        VZEROUPPER
    }
}

Compiled with

    clang-cl /O2 /FA -c test.cpp

produces

    #APP
    vzeroupper
    #NO_APP
    ret

which seems ok. However, compiled with

    clang-cl -mavx2 /O2 /FA -c test.cpp

produces

    sub rsp, 168
    vmovaps xmmword ptr [rsp + 144], xmm15 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 128], xmm14 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 112], xmm13 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 96], xmm12 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 80], xmm11 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 64], xmm10 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 48], xmm9 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 32], xmm8 # 16-byte Spill
    vmovaps xmmword ptr [rsp + 16], xmm7 # 16-byte Spill
    vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill
    #APP
    vzeroupper
    #NO_APP
    vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload
    vmovaps xmm7, xmmword ptr [rsp + 16] # 16-byte Reload
    vmovaps xmm8, xmmword ptr [rsp + 32] # 16-byte Reload
    vmovaps xmm9, xmmword ptr [rsp + 48] # 16-byte Reload
    vmovaps xmm10, xmmword ptr [rsp + 64] # 16-byte Reload
    vmovaps xmm11, xmmword ptr [rsp + 80] # 16-byte Reload
    vmovaps xmm12, xmmword ptr [rsp + 96] # 16-byte Reload
    vmovaps xmm13, xmmword ptr [rsp + 112] # 16-byte Reload
    vmovaps xmm14, xmmword ptr [rsp + 128] # 16-byte Reload
    vmovaps xmm15, xmmword ptr [rsp + 144] # 16-byte Reload
    add rsp, 168
    vzeroupper
    ret

As none of the xmm6-xmm15 are touched by the inline assembly code (VZEROUPPER), it makes no sense to preserve them. Also, additional VZEROUPPER added by the compiler is excessive.

llvmbot commented 10 months ago

@llvm/issue-subscribers-backend-x86

Author: None (llvmbot)

| | | | --- | --- | | Bugzilla Link | [51222](https://llvm.org/bz51222) | | Version | 12.0 | | OS | All | | Reporter | LLVM Bugzilla Contributor | ## Extended Description The issue is first observed with clang 10.0 bundled with MS Visual Studio 2019 on windows, but later confirmed with clang 7.0.1 on Linux (CentOS 7.7) and with clang 12.0 bundled with Xcode 12.2 on Mac OS. Here is the minimal reproducible example: ```cpp void test(void) { __asm { VPXOR YMM6, YMM6, YMM6 } } ``` When compiled on windows with ``` clang-cl /O2 /FA -c test.cpp ``` it produces the following assembly (meta-information skipped for clarity) ```asm #APP vpxor ymm6, ymm6, ymm6 #NO_APP ret ``` As you can see XMM6 is not preserved even though it is clobbered by vpxor instruction. If I pass the -mavx2 flag to the compiler, however ``` clang-cl /O2 -mavx2 /FA -c test.cpp ``` the produced assembly turns into ```asm sub rsp, 24 vmovaps xmmword ptr [rsp], xmm6 # 16-byte Spill #APP vpxor ymm6, ymm6, ymm6 #NO_APP vmovaps xmm6, xmmword ptr [rsp] # 16-byte Reload add rsp, 24 vzeroupper ret ``` XMM6 is now preserved. The same issue is present on Linux and Mac OS. However ms_abi must be explicitly stated now: ```cpp void __attribute__((ms_abi)) test(void) { __asm { VPXOR YMM6, YMM6, YMM6 } } ``` Compiling on Linux with ``` clang -O2 -fasm-blocks -S test.cpp ``` produces ```asm #APP vpxor %ymm6, %ymm6, %ymm6 #NO_APP retq ``` Compiling with ``` clang -O2 -mavx2 -fasm-blocks -S test.cpp ``` produces ```asm subq $24, %rsp vmovaps %xmm6, (%rsp) # 16-byte Spill #APP vpxor %ymm6, %ymm6, %ymm6 #NO_APP vmovaps (%rsp), %xmm6 # 16-byte Reload addq $24, %rsp vzeroupper retq ``` Compiling on Mac OS with ``` /System/Volumes/Data/Applications/Xcode_12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -O2 -fasm-blocks -S test.cpp ``` produces ```asm pushq %rbp movq %rsp, %rbp ## InlineAsm Start vpxor %ymm6, %ymm6, %ymm6 ## InlineAsm End popq %rbp retq ``` Compiling with ``` /System/Volumes/Data/Applications/Xcode_12.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/clang -O2 -mavx2 -fasm-blocks -S test.cpp ``` produces ```asm pushq %rbp movq %rsp, %rbp subq $16, %rsp vmovaps %xmm6, -16(%rbp) ## 16-byte Spill ## InlineAsm Start vpxor %ymm6, %ymm6, %ymm6 ## InlineAsm End vmovaps -16(%rbp), %xmm6 ## 16-byte Reload addq $16, %rsp popq %rbp vzeroupper retq ``` Additional comments and observations. - The issue only happens with Intel-style assembly blocks. Using gcc-style inline assembly and explicitly mentioning the registers in the clobber list produces the correct code. - The real world code, of course, is much more involved and contains cpuid-based branches for avx2 and non-avx2 platforms. That means that we must compile without the -mavx2 switch to support both.

llvm / llvm-project

Clobbered XMM registers are not preserved around Intel-style inline assembly blocks in MS-ABI functions #50566

Extended Description