redorav / hlslpp

Math library using HLSL syntax with multiplatform SIMD support
MIT License
580 stars 47 forks source link

Undefined behavior when accessing vector elements with operator[] #78

Closed mlangerak closed 10 months ago

mlangerak commented 11 months ago

The way swizzles are implemented on vectors and matrices is UB due to the strict aliasing rule. This is how the vector data is defined:

union
{
    n128 vec;
    float f32[1];
    #include "swizzle/hlsl++_vector_float_x.h"
};

Type punning accesses through f32[] vs the swizzle struct is UB, since float and the nested n128 inside the swizzle struct definition are different types. (It is probably strictly speaking also UB for type punning through vec and the swizzle struct but that is unlikely to be a problem in practice I imagine.)

Looking at the code briefly, it seems thef32[]member is only used for component access through operator[]? If so, can f32[] be removed and the component access be done through permutes on vec to avoid UB?

redorav commented 11 months ago

HI @mlangerak, is this something you have come across in practice? I tend to be quite practical with regards to UB, and unions are something compilers tend to make exceptions with even with strict aliasing on. In this instance in fact, a float points to another float, so I can't see where that would be an issue. I make an explicit mention of this in the readme:

The f32 members of float4 and the [ ] operators make use of the union directly, so the generated code is up to the compiler. Use with care

The compiler needs to issue the SIMD instructions for this, and it is up to it to generate whatever code it feels is appropriate. It's not something that is necessarily good practice all the time, but if you're just setting up matrices or whatever or quickly fiddling about, it can be useful.

mlangerak commented 11 months ago

The union consists of a float[] and an n128 which when, vector intrinsics are enabled, are distinct types e.g. float and __m128. The compiler is only required to track the most recently written union member and accessing the other one is UB. Whether or not that is a problem in practice is compiler dependent as you mention. I haven't directly encountered an issue with this in using hlsl++ but it is a concern, esp. with the gcc compiler which can be perversely aggressive with exploiting UB in optimizations.

Also there might be a performance concern with this -- does the optimizer recognize the float access through operator[] can be achieved with permute on the n128? I haven't checked the assembly output, but I can imagine the compiler will throw up its hands in that case and do a round trip through memory instead.

redorav commented 11 months ago

I added this feature for someone who asked for it as a quick way to access floats, in the understanding that it is up to the compiler. I don't even know how I would do this in some other way, having to return a float&. There is a performance concern, that's why there is a caveat that it's up to the compiler and that's why the recommended way to do it is either via the swizzles or the constructors.

In case you're curious, here's a quick godbolt example that you may want to play with, you can see that the compilers know exactly what I'm doing, GCC is quite clever when you set the vectors as constants. https://godbolt.org/z/qvooMG9EK

Anyway, I would rather we come up with a case where it breaks and check the assembly, there's no point trying to fix something without hard evidence. I'll leave this open for the reply.

mlangerak commented 11 months ago

The reason why I started digging into the hlsl++ implementation, and I noticed the UB in the use of unions, is that I get incorrect results from hlsl++ in release builds. This occurs only on Apple Clang 15 at optimization level -O3. Apple Clang -O2 and MSVC at max optimization level give correct results (I have not tried GCC.) Under Apple Clang -O3 the code either produces garbage results or crashes, depending on minor changes in my code that uses hlsl++. Interestingly, when I compile with HLSLPP_SCALAR the code works correctly, even with -O3.

This is a fairly large codebase, and narrowing down to a minimal repro would be very time consuming. I cannot claim that the errors I am seeing at -O3 is due to UB in the use of unions, but nothing else in the hlsl++ code stands out as problematic so it seems the most likely culprit? Unfortunately, I noticed that the union float/vector aliasing is quite pervasive in hlsl++, e.g. it is used in swizzle1 too, and I gave up on my attempts at hacking at hlsl++ to remove this source of UB to prove/disprove whether or not it is causing the errors I am seeing at -O3.

redorav commented 11 months ago

I see what you mean, it might have been more useful to lead with that so I can understand your problem better :)

Just to reply to a few points, the swizzles rely on the union, there's nothing that can be done about that, it's the only way I could think of to enable code like a = b.xzyx, so it's essentially a core feature. That said, I can't imagine why simply having that there would be an issue if you don't use swizzles.

I don't remember if the unit tests are run with -O3 in appveyor, I could try that as someone else reported a Mac-specific issue that I wasn't able to reproduce. I don't have a Mac proper (I test in a virtual machine + appveyor) so that's a difficulty for me in that sense as well. Is this apple silicon or intel? What version of Mac and the Clang compiler? That's important as in one case you'd use NEON and in the other SSE, just so we can narrow down the issue.

In terms of the problem itself, are you able to provide a snippet of code of where it goes wrong to look at it? I'm happy to try a bit of guesswork if that might narrow it down.

mlangerak commented 11 months ago

I see what you mean, it might have been more useful to lead with that so I can understand your problem better :)

That was somewhat hot of the press, before I could figure out if the problem was in my code or in hlsl++. Actually, it is still unclear, since even with HLSLPP_SCALAR I notice there are subtle differences in output between my GPU and CPU with hlsl++ codepaths (the differences are greater than what can be explained with fp roundoff error or indeterminism due to GPU threading execution order.)

This is on Apple Silicon, M2, Apple Clang 1500.0.40.1, Sonoma 14.1.1.

I do use swizzles in my code quite a bit, so I am dependent on the union trick doing the right thing. I did eliminate all uses of operator[] in my client code, so the only uses are within hlsl++ itself, but that made no difference. I'll work on narrowing down the problem on my end, I don't know if I'll be successful to reduce it to a snippet I can post but I'll try.

mlangerak commented 11 months ago

The subtle difference with HLSLPP_SCALAR was due to a bug in hlsl++, I opened a new issue for this.

mlangerak commented 11 months ago

OK so maybe this is a compiler bug in (Apple) clang? Whilst hacking my code trying to narrow down the problem, I got into a state where it would crash in the NEON implementation of vpermq_s32 with -O3 but it would work fine with -O2.

Comparing the disassembly, with -O2:

mov.d  v2[1], v1[0]
tbl.8b v1, { v2 }, v8
mov.d  v0[1], v1[0]
ldr    q2, [sp]
mul.4s v23, v0, v2
fmov   w9, s1

With -O3:

mov.d  v24[1], v23[0]
.long  0x0e033717  ; unknown opcode
mov.d  v22[1], v23[0]
mul.4s v22, v22, v1
fmov   w9, s23

The compiler generates an unknown opcode with -O3...?! Never seen that before :O

redorav commented 11 months ago

That does sound pretty strange. Maybe I'm doing something weird, have you tried vshufq_s32 or vpermq_f32? Those are all relatively similar

redorav commented 10 months ago

@mlangerak I'll close this off on the assumption that it's fixed, let me know if I can help any further