llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.65k stars 11.84k forks source link

LLVM/Clang miscompiles GNU C generic vector code when asked to do 512-bit SIMD operations #59243

Closed ryao closed 1 year ago

ryao commented 1 year ago

A while back, Intel invented a technique for doing fast fletcher4 calculations via 4x parallel accumulator streams and used it to do a fast AVX2 implementation of fletcher4:

https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html

Later, someone extended it to perform 8 independent accumulator streams:

https://github.com/openzfs/zfs/commit/70b258fc962fd40673b9a47574cb83d8438e7d94

I found that I could obtain the same assembly that Intel wrote by compiling generic version written in GNU C:

https://gcc.godbolt.org/z/e3Kf4TcPo

In specific, I get the following output from Clang for the loop, which is the same as what Intel wrote:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       ymm4, xmmword ptr [rsi]
        vpaddq  ymm0, ymm4, ymm0
        vpaddq  ymm1, ymm0, ymm1
        vpaddq  ymm2, ymm1, ymm2
        vpaddq  ymm3, ymm2, ymm3
        add     rsi, 16
        cmp     rsi, rdx
        jb      .LBB0_2

I then was curious about the performance of the 8 stream version, so I modified my GNU C code to do 8 independent accumulator streams:

https://gcc.godbolt.org/z/vhvesMhdh

That produced the following for the loop:

.LBB0_2:                                # =>This Inner Loop Header: Depth=1
        vpmovzxdq       ymm4, xmmword ptr [rsi]
        vpaddq  ymm0, ymm0, ymm4
        vpaddq  ymm1, ymm0, ymm1
        vpaddq  ymm2, ymm1, ymm2
        vpaddq  ymm3, ymm2, ymm3
        add     rsi, 32
        cmp     rsi, rdx
        jb      .LBB0_2

If it were not for the add rsi, 32, I would have been certain that I had compiled the wrong code. I had expected to see either AVX-512 or another vpmovzxdq and 4 more vpaddq instructions to process the additional 256-bits, but the compiler opted to assume that the AVX2 instructions were capable of doing 512-bit SIMD. There are other issues with the generated assembly too (visible at godbolt). In particular, it treats the loads and stores as if they operate on 512-bit vectors.

Interestingly, GCC has the same behavior where it emits 256-bit SIMD instructions as if they were 512-bit SIMD instructions. ICC on the other hand will refuse to compile the 512-bit version, claiming that vector operations are not supported with these operand types when it sees the __builtin_convertvector().

Note that the checksum algorithm does 64-bit unsigned additions on 32-bit values. For the original Intel version, each loop iteration would read 128-bits. These would be padded to 256-bits by zero extending the 32-bit components to 64-bit components (which is what__builtin_convertvector() does). Then the 256-bit additions would be done. For the 8 accumulator stream version, those two vector widths are doubled.

topperc commented 1 year ago

Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.

ryao commented 1 year ago

Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.

That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

ryao commented 1 year ago

@EugeneZelenko This is not a x86 specific issue. The same issue happens on ppc64 and aarach64 too. The only difference is that the ppc64 and aarch64 instructions are only able to handle 128-bit vectors, but this bug causes llvm to emit code that treats the vectors as if they were 256-bit on the 512-bit vector case.

EugeneZelenko commented 1 year ago

@ryao: Sorry for my premature incorrect conclusion. Please add relevant label.

topperc commented 1 year ago

Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.

That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.

I'm not following. The original 128-bit code used vector_size(32) for u64x4. The 256-bit code still uses vector_size(32) for u64x8. Shouldn't it be vector_size(64)?

ryao commented 1 year ago

Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.

That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.

I'm not following. The original 128-bit code used vector_size(32) for u64x4. The 256-bit code still uses vector_size(32) for u64x8. Shouldn't it be vector_size(64)?

I goofed when modifying this. Sorry for the noise.