Open Anton3 opened 1 year ago
We now provide compiler::ThreadLocal
that should be used in services instead of thread_local
.
However, third-party libraries still may and do use thread_local
, and until Clang supports an option for fiber-safe thread_local
optimizations, there will be no safe solution.
We've disabled LTO by default due to ongoing issues with third-party libraries. For example, when linking Jemalloc statically, the following code causes a data race:
void* smth = malloc(...);
engine::Yield();
free(smth);
We are currently discussing the possibility of implementing a "fiber-safe optimizations" flag in Clang. This will take several months.
We've started the discussion on the topic at https://discourse.llvm.org/t/misoptimization-of-tls-and-attibute-const-in-stackful-coroutines-ffunction-thread-migration/80081
Also, there's a chance that -femulated-tls
could be useful in solving this issue.
Minimal reproducer:
Let's say that after the context switch, the task is scheduled on another thread.
It's expected then that after
Yield
, the new instance ofx
(which is local to the current thread) will be incremented. But that's not what happens! The compiler can't imagine the situation that the address of athread_local
variable changes, so it only loads the address ofx
once and caches it across the suspension point. Then it keeps using the variable of another thread, causing a data race, because another thread may use it at the same time (it has every right to do so, because it is its own instance of thethread_local
variable).This bug has been documented with little reaction from compiler maintainers:
Recent versions of Clang (at least Clang-14) do this optimization under LTO even when variable access happens inside a function.
Marking the variable
atomic
,volatile
, addingasm volatile("" ::: "memory");
does not help.The only thing that helps is to mark the function that accesses the variable as
noinline
, as we've done in bug core: prevent clang 14 LTO UB thread_local access. We, however, cannot expect all the user and library code to do that.A reproducer test has been added in feat engine: demonstrate that thread_local variables don't work with userver.