clang introduces undefined behavior into well-formed C++ program with infinite loop

gonzalobg commented 1 year ago

I believe this program is well-formed and the C++ standard guarantees that a thread that reaches block will not execute any subsequent instructions, since the thread performs an execution step on every loop iteration (the atomic operation).

Therefore, when example(nullptr) is called, the C++ standard guarantees that a null pointer will never be dereferenced.

However, Clang (Godbolt) removes the loop, causing the program to dereference a nullptr, therefore introducing undefined behavior into a well-formed program. My assessment is that Clang is incorrectly not inserting an llvm.sideeffect intrinsic inside the loop.

GCC does not have this bug and correctly preserves the infinite loop (Godbolt), avoiding the introduction of undefined behavior into a well-formed program.

I tried to find related bugs, but there seems to be a lot of noise (bugs about infinite loops that perform no steps), and no "unsound" label to filter for cases in which Clang is known to introduce UB into well-formed programs.

Reproducer:

#include <atomic>

void block() {
    std::atomic<int> x = 0;
    while (true) {
        if (x.load(std::memory_order_relaxed) == 1)
            break;
    }
}

void example(int* data) {
    block();
    *data = 42;
}

int main() {
   example(nullptr);
   return 0;
}

This is a slightly different reproducer (Godbolt), showing that perform_step is actually optimized in isolation to remove the atomic memory operation, without inserting a llvm.sideeffect as a replacement:

void perform_step() {
    std::atomic<int> x = 0;
    x.load(std::memory_order_relaxed);
}

as a consequence, when perform_step is called from or inlined into:

void block() {
    while (true) perform_step();
}

that code is miscompiled as well.

From skimming the LLVM IR that Clang generates, it seems that to fix this issue with current LLVM semantics, Clang is going to have to insert llvm.sideeffect in many places, and trying to minimize those will require complex logic.

volatile memory operations do not seem to have this issue (Godbolt), but generate very sub-optimal code that performs unnecessary memory operations. While treating atomic operations as volatile could be a quick workaround to fix this soundness bug, experiments like this one show that this workaround would probably regress performance across the board.

These idioms are the only reliable way to block C++ threads forever, e.g., to prevent them from advancing under broken program invariants and exhibiting undefined behavior at runtime. With GCC, they are a reliable way to, for example, block all threads arriving at a particular synchronization point from proceeding before one particular thread arrives that, e.g., terminates the program:

void block_and_terminate(size_t thread_id, size_t master_thread_id) {
    if (thread_idx == master_thread_idx) {
        std::terminate();
    } else {
        block(); 
    }
}

llvmbot commented 1 year ago

@llvm/issue-subscribers-clang-codegen

Endilll commented 1 year ago

There's a related paper that is currently going through WG21: https://github.com/cplusplus/papers/issues/1496

RalfJung commented 1 year ago

@Endilll that paper is not really related. This issue is about loops which C++ already explicitly says are Well-Defined (because they contain an atomic operation). That paper would only make even more loops defined.

shafik commented 1 year ago

This does look like a bug, CC @AaronBallman @dwblaikie since I am not sure who the right person to pull in for this.

AaronBallman commented 1 year ago

CC @rjmccall @efriedma-quic as codegen code owners

I don't think LLVM or Clang documents what forward progress guarantees the main thread has (http://eel.is/c++draft/intro.progress#8), but I would imagine we want to provide concurrent forward progress guarantees (http://eel.is/c++draft/intro.progress#7) which would make this a bug.

efriedma-quic commented 1 year ago

The issue is that LLVM's notion of forward progress doesn't quite agree with the standard. Specifically, it doesn't treat a constant-foldable atomic load as "forward progress": it eliminates the load without leaving any indication in the IR that an atomic operation was eliminated.

These idioms are the only reliable way to block C++ threads forever

This is obviously not true. You can, for example, wait on an std::condition_variable (which is also a lot more efficient because it indicates to the OS that the thread is idle).

gonzalobg commented 1 year ago

This is obviously not true. You can, for example, wait on an std::condition_variable (which is also a lot more efficient because it indicates to the OS that the thread is idle).

Should have clarified with "...on freestanding" (since that is what I'm targetting). Is there a way to block a thread forever on freestanding targets without resorting to inline asm?

rjmccall commented 1 year ago

The issue is that LLVM's notion of forward progress doesn't quite agree with the standard. Specifically, it doesn't treat a constant-foldable atomic load as "forward progress": it eliminates the load without leaving any indication in the IR that an atomic operation was eliminated.

Yeah, having this optimization leave an llvm.sideeffect behind seems like the most straightforward way to fix the issue, although it unfortunately ties IR even more to C/C++ semantics.

rjmccall commented 1 year ago

Is there a way to block a thread forever on freestanding targets without resorting to inline asm?

The simplest way would be to loop forever loading or storing to a volatile global. I believe LLVM understands that it isn't allowed to fold those even if it can prove that they're the only accesses.

RalfJung commented 1 year ago

although it unfortunately ties IR even more to C/C++ semantics.

Well it's really part of LLVM's mustprogress attribute. LLVM itself documents that atomics count as making progress. So this optimization violates what is written in the LangRef.

If there's no mustprogress then atomics can be removed without replacement.

efriedma-quic commented 1 year ago

For reference:

The relevant standard paragraph here is actually [intro.progess]p1: "The implementation may assume that any thread will eventually do one of the following [...]".
The relevant bit of LangRef: https://llvm.org/docs/LangRef.html#langref-mustprogress .

it unfortunately ties IR even more to C/C++ semantics

The definition of "mustprogress" is already tied to C++ semantics, and I don't think there's any way around that. Maybe we can figure out some way to avoid complicating programs that aren't using "mustprogress" markings.

This is obviously not true. You can, for example, wait on an std::condition_variable (which is also a lot more efficient because it indicates to the OS that the thread is idle).

Should have clarified with "...on freestanding" (since that is what I'm targetting).

If your environment has threads, it should provide some way for a thread to block without spinning the CPU, and you should probably use it.

Otherwise, a loop involving a volatile operation or asm(""); should be reliable. (You mention "very sub-optimal code", but if the CPU isn't doing any useful work anyway, I can't see how that matters.) Or you can use -fno-finite-loops to turn off the relevant optimizations.

gonzalobg commented 1 year ago

If your environment has threads, it should provide some way for a thread to block without spinning the CPU, and you should probably use it.

Both C++ and clang support environments with threads that don't provide this.

You mention "very sub-optimal code", but if the CPU isn't doing any useful work anyway, I can't see how that matters.

Both C++ and clang support environments with more than one hardware thread, on which a hardware thread spinning doing nothing is significantly better than spinning by bombarding the memory subsystem with useless requests that may slow down the progress of other threads.

Or you can use -fno-finite-loops to turn off the relevant optimizations. Otherwise, a loop involving a volatile operation or asm("");

The -fno-finite-loops seems like a pretty big hammer with wide performance impact. Spinning on a volatile asm("") sounds reasonable.

rjmccall commented 1 year ago

Both C++ and clang support environments with more than one hardware thread, on which a hardware thread spinning doing nothing is significantly better than spinning by bombarding the memory subsystem with useless requests that may slow down the progress of other threads.

Your idiom has the same impact on the memory subsystem as loading from a volatile global would.

gonzalobg commented 1 year ago

Your idiom has the same impact on the memory subsystem as loading from a volatile global would.

AFAICT, C++ compilers are allowed to lower the volatile and atomic implementations of the idiom to the equivalent of:

LABEL:  jump LABEL;

on most hardware architectures, since the effects of atomic or volatile ops on a private automatic variable that is not escaped can't be observed by the program.

Your point is therefore not obvious to me. Are you claiming that the impact of moving the stack pointer back by one instruction on the memory subsystem is the same as that of issuing an actual load for all hardware?

rjmccall commented 1 year ago

I agree that it's easier to optimize your idiom. I'm just saying that the compilers which aren't miscompiling it do not in fact optimize away the load, so the idiom's performance benefit over a volatile load is currently just theoretical. Also, in practice both patterns will just hit L1 forever and never bother the rest of memory.

gonzalobg commented 1 year ago

@rjmccall

Also, in practice both patterns will just hit L1 forever and never bother the rest of memory.

This does not hold for many environments clang supports. For example, the NVPTX backend will spill the automatic variable to local memory and perform local memory accesses while spinning.

llvm / llvm-project

clang introduces undefined behavior into well-formed C++ program with infinite loop #62057