Closed eloj closed 1 month ago
As evidence for my assertion I'll provide the following example program (godbolt link):
Running MSVC:
sizeof(bi4)=16, sizeof(bi4[0])=4, sizeof(uint)=4
ref:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
direct indexed:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
_mm_extract_epi32:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
Running GCC/Clang (also locally). Note that in the 'direct indexed' case, i1 and i2 are wrong:
sizeof(bi4)=16, sizeof(bi4[0])=8, sizeof(uint)=4
ref:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
direct indexed:
i0=d0e0f000
i1=50607080
i2=d0e0f000
_mm_extract_epi32:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
The test program in full.
// g++ -O2 -Wall -std=c++20 -ggdb -march=native -mavx test_avx.cpp -o test_avx
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <xmmintrin.h>
#include <smmintrin.h> // for _mm_extract_epi32
#ifdef _MSC_VER
#define ILANE(a,b) a.m128i_i32[b]
#else
// warning: array subscript 2 is above array bounds of ‘long long int [2]’ [-Warray-bounds=]
#define ILANE(a,b) (a[b])
#endif
#define xLANE(a,b) _mm_extract_epi32(a, b)
void print_m128i(__m128i v) {
alignas(16) uint32_t i[4];
_mm_store_si128((__m128i*)i, v);
printf("ref:\n");
printf("i0=%x\n", i[0]);
printf("i1=%x\n", i[1]);
printf("i2=%x\n", i[2]);
}
void test_indexed(__m128i bi4) {
unsigned int i0 = ILANE( bi4, 0 ), i1 = ILANE( bi4, 1 ), i2 = ILANE( bi4, 2 );
printf("direct indexed:\n");
printf("i0=%x\n", i0);
printf("i1=%x\n", i1);
printf("i2=%x\n", i2);
}
void test_extract(__m128i bi4) {
unsigned int i0 = xLANE( bi4, 0 ), i1 = xLANE( bi4, 1 ), i2 = xLANE( bi4, 2 );
printf("_mm_extract_epi32:\n");
printf("i0=%x\n", i0);
printf("i1=%x\n", i1);
printf("i2=%x\n", i2);
}
int main(int argc, char *argv[]) {
alignas(16) __m128i bi4 = _mm_set_epi32(0x10203040, 0x50607080, 0x90A0B0C0, 0xD0E0F000);
printf("sizeof(bi4)=%zu, sizeof(bi4[0])=%zu, sizeof(uint)=%zu\n", sizeof(bi4), sizeof(ILANE(bi4, 0)), sizeof(unsigned int));
print_m128i(bi4);
test_indexed(bi4);
test_extract(bi4);
}
I think there is an actual issue here. An __m128i can be interpreted in several ways; MSVC for that reason uses bi4.m128i_i32[0] instead of just bi4[0]. There is a safe way to handle this but that involves a union between the __m128i and four integers, which forces it (at least in theory) to memory. Since this is performance-critical code, that is not an option. Another option is to use _mm_shuffle to get an integer to the lowest lane and extract it as a single int. I suspect this is what MSVC does under the hood. It will be ugly code, but it will run at full speed. I will investigate further.
This is now resolved. I did some brief experiments with compiler explorer and the current solution appears to reduce to the bare minimum of instructions.
This might become a discussion piece I'm guessing.
I'm not convinced the existing code is working correctly when built against GCC's library.
The type
_m128i
is defined as a two-component array (64-bits each), so we can't directly index 32-bit ints, and the compiler warns us as such.What I can not explain is why it seems to work anyway. For example, this change does not impact the output of
tiny_bvh_renderer
(!), which at least to me seems surprising, and makes me feel like I'm missing something.