Closed ryao closed 1 year ago
Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.
Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.
That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.
@llvm/issue-subscribers-backend-x86
@EugeneZelenko This is not a x86 specific issue. The same issue happens on ppc64 and aarach64 too. The only difference is that the ppc64 and aarch64 instructions are only able to handle 128-bit vectors, but this bug causes llvm to emit code that treats the vectors as if they were 256-bit on the 512-bit vector case.
@ryao: Sorry for my premature incorrect conclusion. Please add relevant label.
Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.
That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.
I'm not following. The original 128-bit code used vector_size(32)
for u64x4
. The 256-bit code still uses vector_size(32)
for u64x8
. Shouldn't it be vector_size(64)
?
Unless I’m mistaken is t the 512-bit code using vector_size(32)? That’s a 32 byte or 256-bit vector.
That is correct. The way that the algorithm works is that it reads 32-bit values and then sums them in 64-bit registers. The 256-bit version does 128-bit reads. The 512-bit version does 256-bit reads.
I'm not following. The original 128-bit code used
vector_size(32)
foru64x4
. The 256-bit code still usesvector_size(32)
foru64x8
. Shouldn't it bevector_size(64)
?
I goofed when modifying this. Sorry for the noise.
A while back, Intel invented a technique for doing fast fletcher4 calculations via 4x parallel accumulator streams and used it to do a fast AVX2 implementation of fletcher4:
https://www.intel.com/content/www/us/en/developer/articles/technical/fast-computation-of-fletcher-checksums.html
Later, someone extended it to perform 8 independent accumulator streams:
https://github.com/openzfs/zfs/commit/70b258fc962fd40673b9a47574cb83d8438e7d94
I found that I could obtain the same assembly that Intel wrote by compiling generic version written in GNU C:
https://gcc.godbolt.org/z/e3Kf4TcPo
In specific, I get the following output from Clang for the loop, which is the same as what Intel wrote:
I then was curious about the performance of the 8 stream version, so I modified my GNU C code to do 8 independent accumulator streams:
https://gcc.godbolt.org/z/vhvesMhdh
That produced the following for the loop:
If it were not for the
add rsi, 32
, I would have been certain that I had compiled the wrong code. I had expected to see either AVX-512 or another vpmovzxdq and 4 more vpaddq instructions to process the additional 256-bits, but the compiler opted to assume that the AVX2 instructions were capable of doing 512-bit SIMD. There are other issues with the generated assembly too (visible at godbolt). In particular, it treats the loads and stores as if they operate on 512-bit vectors.Interestingly, GCC has the same behavior where it emits 256-bit SIMD instructions as if they were 512-bit SIMD instructions. ICC on the other hand will refuse to compile the 512-bit version, claiming that vector operations are not supported with these operand types when it sees the
__builtin_convertvector()
.Note that the checksum algorithm does 64-bit unsigned additions on 32-bit values. For the original Intel version, each loop iteration would read 128-bits. These would be padded to 256-bits by zero extending the 32-bit components to 64-bit components (which is what
__builtin_convertvector()
does). Then the 256-bit additions would be done. For the 8 accumulator stream version, those two vector widths are doubled.