jbikker / tinybvh

Single-header BVH construction and traversal library.
MIT License
530 stars 25 forks source link

Remove explicit casts. #10

Closed eloj closed 4 weeks ago

eloj commented 4 weeks ago

The code gen isn't exactly the same before-and-after, but it's very minor. Some timings in the post after this.

The GCC diff is less clean and slightly worse-looking, but the clang one looks okay to me.

I understand if you prefer the casting in the end.

(Before this example was updated, I got it bit-exact by using adding a conversion function[^1] to the TriVertex type. I imagine it would be possible to do the same directly on the bvhvec4, but that's a more intrusive change.)

[^1]: i.e operator bvhvec3&() const { return *(bvhvec3*)(this); }

$ diff -u asm-orig-clang++-19-tiny_bvh_fenster.s asm-new-clang++-19-tiny_bvh_fenster.s 
--- asm-orig-clang++-19-tiny_bvh_fenster.s  2024-11-02 22:15:59.183094736 +0100
+++ asm-new-clang++-19-tiny_bvh_fenster.s   2024-11-02 22:16:11.761182138 +0100
@@ -735,27 +735,24 @@
    movslq  %r10d, %r11
    shlq    $4, %r10
    shlq    $4, %r11
-   vmovss  4(%r10,%rcx), %xmm7             # xmm7 = mem[0],zero,zero,zero
-   vmovss  8(%r10,%rcx), %xmm8             # xmm8 = mem[0],zero,zero,zero
-   vmovss  (%r10,%rcx), %xmm6              # xmm6 = mem[0],zero,zero,zero
-   vmovss  24(%r11,%rcx), %xmm11           # xmm11 = mem[0],zero,zero,zero
-   vmovss  36(%r11,%rcx), %xmm13           # xmm13 = mem[0],zero,zero,zero
-   vmovss  16(%r11,%rcx), %xmm9            # xmm9 = mem[0],zero,zero,zero
-   vmovss  20(%r11,%rcx), %xmm10           # xmm10 = mem[0],zero,zero,zero
-   vmovss  32(%r11,%rcx), %xmm12           # xmm12 = mem[0],zero,zero,zero
-   vmovss  40(%r11,%rcx), %xmm14           # xmm14 = mem[0],zero,zero,zero
-   vsubss  %xmm8, %xmm11, %xmm11
-   vsubss  %xmm7, %xmm13, %xmm13
-   vsubss  %xmm7, %xmm10, %xmm10
-   vsubss  %xmm6, %xmm9, %xmm9
-   vsubss  %xmm6, %xmm12, %xmm12
-   vsubss  %xmm8, %xmm14, %xmm7
-   vmulss  %xmm11, %xmm13, %xmm6
-   vmulss  %xmm10, %xmm12, %xmm8
+   vmovsd  (%r10,%rcx), %xmm6              # xmm6 = mem[0],zero
+   vmovss  8(%r10,%rcx), %xmm11            # xmm11 = mem[0],zero,zero,zero
+   vmovsd  16(%r11,%rcx), %xmm7            # xmm7 = mem[0],zero
+   vmovsd  32(%r11,%rcx), %xmm8            # xmm8 = mem[0],zero
+   vsubps  %xmm6, %xmm7, %xmm9
+   vmovss  24(%r11,%rcx), %xmm7            # xmm7 = mem[0],zero,zero,zero
+   vsubps  %xmm6, %xmm8, %xmm8
+   vmovss  40(%r11,%rcx), %xmm6            # xmm6 = mem[0],zero,zero,zero
+   vmovshdup   %xmm8, %xmm13           # xmm13 = xmm8[1,1,3,3]
+   vmovshdup   %xmm9, %xmm10           # xmm10 = xmm9[1,1,3,3]
+   vsubss  %xmm11, %xmm7, %xmm12
+   vsubss  %xmm11, %xmm6, %xmm7
+   vmulss  %xmm12, %xmm13, %xmm6
    vfmsub231ss %xmm7, %xmm10, %xmm6    # xmm6 = (xmm10 * xmm7) - xmm6
    vmulss  %xmm7, %xmm9, %xmm7
+   vfmsub231ss %xmm12, %xmm8, %xmm7    # xmm7 = (xmm8 * xmm12) - xmm7
+   vmulss  %xmm10, %xmm8, %xmm8
    vfmsub231ss %xmm13, %xmm9, %xmm8    # xmm8 = (xmm9 * xmm13) - xmm8
-   vfmsub231ss %xmm11, %xmm12, %xmm7   # xmm7 = (xmm12 * xmm11) - xmm7
    vmulss  %xmm7, %xmm7, %xmm9
    vfmadd231ss %xmm6, %xmm6, %xmm9     # xmm9 = (xmm6 * xmm6) + xmm9
    vfmadd231ss %xmm8, %xmm8, %xmm9     # xmm9 = (xmm8 * xmm8) + xmm9
eloj commented 4 weeks ago

Ran the matrix of clang++-19 and GCC 14.2 before and after this change using the timing branch (#11), no special prep:

before-gcc.txt:Frame 8 time: 1529 ms, avg: 1531.00 ms
after-gcc.txt:Frame 8 time: 1531 ms, avg: 1529.00 ms
before-clang.txt:Frame 8 time: 1352 ms, avg: 1348.75 ms
after-clang.txt:Frame 8 time: 1339 ms, avg: 1339.38 ms
jbikker commented 4 weeks ago

Removing the casts is fine and makes the code more compact and clear. Originally I had that TriVertex thing in there but that was silly, bvhvec4 is just fine and has automatic conversion to and from bvhvec3.