jbikker / tinybvh

Single-header BVH construction and traversal library.
MIT License
533 stars 25 forks source link

Use `_mm_extract_epi32` with GNU's intrinsics. #5

Closed eloj closed 1 month ago

eloj commented 1 month ago

This might become a discussion piece I'm guessing.

I'm not convinced the existing code is working correctly when built against GCC's library.

The type _m128i is defined as a two-component array (64-bits each), so we can't directly index 32-bit ints, and the compiler warns us as such.

tiny_bvh.h:482:23: warning: array subscript 2 is above array bounds of ‘long long int [2]’ [-Warray-bounds=]
  482 | #define ILANE(a,b) a[b]

What I can not explain is why it seems to work anyway. For example, this change does not impact the output of tiny_bvh_renderer(!), which at least to me seems surprising, and makes me feel like I'm missing something.

eloj commented 1 month ago

As evidence for my assertion I'll provide the following example program (godbolt link):

Running MSVC:

sizeof(bi4)=16, sizeof(bi4[0])=4, sizeof(uint)=4
ref:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
direct indexed:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
_mm_extract_epi32:
i0=d0e0f000
i1=90a0b0c0
i2=50607080

Running GCC/Clang (also locally). Note that in the 'direct indexed' case, i1 and i2 are wrong:

sizeof(bi4)=16, sizeof(bi4[0])=8, sizeof(uint)=4
ref:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
direct indexed:
i0=d0e0f000
i1=50607080
i2=d0e0f000
_mm_extract_epi32:
i0=d0e0f000
i1=90a0b0c0
i2=50607080
eloj commented 1 month ago

The test program in full.

// g++ -O2 -Wall -std=c++20 -ggdb -march=native -mavx test_avx.cpp -o test_avx
#include <cstdio>
#include <cstdlib>
#include <cstdint>

#include <xmmintrin.h>
#include <smmintrin.h> // for _mm_extract_epi32

#ifdef _MSC_VER
#define ILANE(a,b) a.m128i_i32[b]  
#else
// warning: array subscript 2 is above array bounds of ‘long long int [2]’ [-Warray-bounds=]
#define ILANE(a,b) (a[b])
#endif

#define xLANE(a,b) _mm_extract_epi32(a, b)

void print_m128i(__m128i v) {
    alignas(16) uint32_t i[4];
    _mm_store_si128((__m128i*)i, v);
    printf("ref:\n");
    printf("i0=%x\n", i[0]);
    printf("i1=%x\n", i[1]);
    printf("i2=%x\n", i[2]);
}

void test_indexed(__m128i bi4) {
    unsigned int i0 = ILANE( bi4, 0 ), i1 = ILANE( bi4, 1 ), i2 = ILANE( bi4, 2 );

    printf("direct indexed:\n");
    printf("i0=%x\n", i0);
    printf("i1=%x\n", i1);
    printf("i2=%x\n", i2);
}

void test_extract(__m128i bi4) {
    unsigned int i0 = xLANE( bi4, 0 ), i1 = xLANE( bi4, 1 ), i2 = xLANE( bi4, 2 );

    printf("_mm_extract_epi32:\n");
    printf("i0=%x\n", i0);
    printf("i1=%x\n", i1);
    printf("i2=%x\n", i2);
}

int main(int argc, char *argv[]) {

    alignas(16) __m128i bi4 = _mm_set_epi32(0x10203040, 0x50607080, 0x90A0B0C0, 0xD0E0F000);

    printf("sizeof(bi4)=%zu, sizeof(bi4[0])=%zu, sizeof(uint)=%zu\n", sizeof(bi4), sizeof(ILANE(bi4, 0)), sizeof(unsigned int));

    print_m128i(bi4);

    test_indexed(bi4);
    test_extract(bi4);
}
jbikker commented 1 month ago

I think there is an actual issue here. An __m128i can be interpreted in several ways; MSVC for that reason uses bi4.m128i_i32[0] instead of just bi4[0]. There is a safe way to handle this but that involves a union between the __m128i and four integers, which forces it (at least in theory) to memory. Since this is performance-critical code, that is not an option. Another option is to use _mm_shuffle to get an integer to the lowest lane and extract it as a single int. I suspect this is what MSVC does under the hood. It will be ugly code, but it will run at full speed. I will investigate further.

jbikker commented 1 month ago

This is now resolved. I did some brief experiments with compiler explorer and the current solution appears to reduce to the bare minimum of instructions.