Clang and compiler-rt disagree on preconditions for __aeabi_llsr and __aeabi_llsl

davidben commented 1 year ago

uint64_t{a} >> b in C is only defined for 0 <= b < 64. On 32-bit platforms, uint64_t is sometimes implemented with an intrinsic. But what are the preconditions on that intrinsic?

The compiler-rt implementations say they share the precondition. See the "Precondition:" comment. In code, they seem to also assume this. The b & bits_in_word check only makes sense because of the bound, and if b were too large, those functions (themselves written in C) would hit the C-level UB. https://github.com/llvm/llvm-project/blob/689de4c6759fa810d827aee06a0ab060b01172ce/compiler-rt/lib/builtins/ashldi3.c https://github.com/llvm/llvm-project/blob/2f4c96097a3cbf9d07eab699efd25f767bb4fdd5/compiler-rt/lib/builtins/lshrdi3.c

However, Clang seems to believe otherwise: https://godbolt.org/z/rjPx5j6ne

First, note -fsanitize=undefined does not add checks to the guarded functions, only the unguarded functions. So Clang seems to believe that r > 63 ? 0 : x >> r is sufficient to dispatch the preconditions. However, the generated code for the guarded functions looks like:

guarded_rshift:
        push    {r4, lr}
        mov     r4, r2
        bl      __aeabi_llsr
        cmp     r4, #63
        movhi   r0, #0
        movhi   r1, #0
        pop     {r4, lr}
        bx      lr
guarded_lshift:
        push    {r4, lr}
        mov     r4, r2
        bl      __aeabi_llsl
        cmp     r4, #63
        movhi   r0, #0
        movhi   r1, #0
        pop     {r4, lr}
        bx      lr

There's no branch before calling __aeabi_llsr and __aeabi_llsl. Clang seems to be transforming this into:

uint64_t guarded_rshift(uint64_t x, unsigned r) {
    // WAS: return r > 63 ? 0 : x >> r;
    uint64_t t = __aeabi_llsr(x, r);
    return r > 63 ? 0 : t;
}

That is, Clang believe those functions are defined for all inputs. It doesn't seem to care what the output is, but it is relying on them to not diverge. This contradicts with the compiler-rt implementation, so I think one needs to change. We have to either believe:

The intrinsics are defined for all inputs. The compiler-rt comments and implementation are wrong. Or...
The intrinsics have the documented preconditions. Clang's output above was invalid.

(No preferences on my end which option makes more sense.)

We ran into this experimenting with Rust in Chrome, because Rust also provides a copy of these functions. When building that with trapping overflow, we get a miscompile. Initially we thought this was just a Rust bug, but it seems the preconditions are unclear even without Rust. I expect a UBSan-built compiler-rt would have the same issues. Possibly further fun times if one ever does LTO here... (We'll file a bug with Rust that they need to make sure their functions match whatever's decided here, depending on which it is.)

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

efriedma-quic commented 1 year ago

Given the way LLVM IR defines shifts, if we require any preconditions on __aeabi_llsr and __aeabi_llsl, LLVM would have to mask the shift amount of every shift. (There's no way to tell if a specific shift is being executed speculatively.) I'd prefer to define the functions in a way which avoids that.

davidben commented 1 year ago

Preconditions would also prevent the optimization here and, at least I assume that optimization is beneficial. 😄

Perhaps the answer is to define it to be wrapping? I.e. __aeabi_llsr(a, b) = a >> (b & 63) and move on with life? An alternative would be a very complex API contract where __aeabi_llsr(a, 1234) is guaranteed to produce some value, but it's unspecified which. But that's only useful if it somehow makes the intrinsic easier to implement. (Clang can always pretend the intrinsic is less useful than it is, so it's better for Clang to make the intrinsic as defined as possible.)

Looking at the compiler-rt function, I think the only changes needed to make it wrapping are:

Cast b to unsigned to not worry about negative b.
Replace if (b & bits_in_word) with if (b >= bits_in_word).
Replace b - bits_in_word with b & (bits_in_word - 1). On ISAs where overshift is already wrapping, this is a no-op and the compiler can delete the mask. On ISAs where overshift is zero, we needed the operation anyway.

Inspecting the output, the two versions seem to generate comparable code. The only variation is sometimes if (b & bits_in_word) is tighter than if (b >= bits_in_word), e.g. TBNZ on aarch64. https://godbolt.org/z/cadY8aPch (there are four tabs on the right pane)

But I'm not actually a compiler person, so will defer to you all. I only stick my nose into such things because C/C++ are such catastrophic UB disasters that it's impossible to use them without a little exposure to such things. So take this as just a suggestion from an exasperated C/C++ user that maybe we don't need UB here. :-)

efriedma-quic commented 1 year ago

Making the implementation in compiler-rt wrap is easy enough; the issue is other existing implementations that don't wrap. Since this is part of the ABI we can't expect users to use the latest version of compiler-rt; they can use any compatible runtime.

It should be safe to assume existing implementations never crash, though, since both clang and gcc speculatively call them with out-of-bounds values.

For ARM specifically, these functions are defined as part of the platform ABI (https://github.com/ARM-software/abi-aa/blob/main/rtabi32/rtabi32.rst), so we should coordinate. CC @davemgreen @stuij ?

davidben commented 1 year ago

The plot thickens further! Looks like Rust already ran into this at some point and made their function total on the latest revisions: https://github.com/rust-lang/compiler-builtins/pull/521 https://github.com/rust-lang/compiler-builtins/commit/1634193e0444a7252d3b87636ad365ab6a7e06cf

Though I'm not sure if their function wraps. It might just return some arbitrary garbage? (Which is sufficient for the optimization in this case, just kinda weird.) @taiki-e FYI.

davidben commented 1 year ago

Hmm, given these are defined in so many places, maybe we'll have to settle for "function is guaranteed to be total, but out-of-bounds return some arbitrary unspecified value", just to avoid invalidating a bunch of impls. Do Clang and GCC only need the functions to be total, or do they have stricter requirements on them?

efriedma-quic commented 1 year ago

Returning an arbitrary value is definitely sufficient for clang. Should also be fine for gcc, I think, although I haven't verified it.

llvm / llvm-project

Clang and compiler-rt disagree on preconditions for __aeabi_llsr and __aeabi_llsl #62743