userver-framework / userver

Production-ready C++ Asynchronous Framework with rich functionality
https://userver.tech
Apache License 2.0
2.46k stars 290 forks source link

thread_local variables cause data races (UB) when used with userver #242

Open Anton3 opened 1 year ago

Anton3 commented 1 year ago

Minimal reproducer:

thread_local int x{};

++x;
engine::Yield();
++x;

Let's say that after the context switch, the task is scheduled on another thread.

It's expected then that after Yield, the new instance of x (which is local to the current thread) will be incremented. But that's not what happens! The compiler can't imagine the situation that the address of a thread_local variable changes, so it only loads the address of x once and caches it across the suspension point. Then it keeps using the variable of another thread, causing a data race, because another thread may use it at the same time (it has every right to do so, because it is its own instance of the thread_local variable).

This bug has been documented with little reaction from compiler maintainers:

Recent versions of Clang (at least Clang-14) do this optimization under LTO even when variable access happens inside a function.

Marking the variable atomic, volatile, adding asm volatile("" ::: "memory"); does not help.

The only thing that helps is to mark the function that accesses the variable as noinline, as we've done in bug core: prevent clang 14 LTO UB thread_local access. We, however, cannot expect all the user and library code to do that.

A reproducer test has been added in feat engine: demonstrate that thread_local variables don't work with userver.

Note that this issue is not the same as:

thread_local int x{};

MyContainer foo(&x);
engine::Yield();
foo.Frobnicate();

This code pattern is buggy in the context of userver, and is expected to be buggy, and we & our users have learned to avoid it.

Anton3 commented 11 months ago

We now provide compiler::ThreadLocal that should be used in services instead of thread_local.

However, third-party libraries still may and do use thread_local, and until Clang supports an option for fiber-safe thread_local optimizations, there will be no safe solution.

Anton3 commented 8 months ago

We've disabled LTO by default due to ongoing issues with third-party libraries. For example, when linking Jemalloc statically, the following code causes a data race:

void* smth = malloc(...);
engine::Yield();
free(smth);

We are currently discussing the possibility of implementing a "fiber-safe optimizations" flag in Clang. This will take several months.

apolukhin commented 2 months ago

We've started the discussion on the topic at https://discourse.llvm.org/t/misoptimization-of-tls-and-attibute-const-in-stackful-coroutines-ffunction-thread-migration/80081

Also, there's a chance that -femulated-tls could be useful in solving this issue.