Open llvmbot opened 3 years ago
I was able to reproduce the problem with the following program (see below). It won't reproduce well on a loaded machine and you may have to fiddle with the spin counts to get a successful repro.
In my tests, I was able to reproduce the problem on:
Note that this cannot be reproduced with WSL1 since Microsoft's implementation of the syscall does not have this bug. It can be reproduced with WSL2 though.
Output from my machine:
Remaining: 0.020000000 - signal count: 0 Remaining: 0.312026100 - signal count: 6322 Remaining: 0.610834500 - signal count: 12729 Remaining: 0.907113100 - signal count: 19096 Remaining: 1.202000100 - signal count: 25452 Remaining: 1.499204500 - signal count: 31842 Remaining: 1.799767600 - signal count: 38277 Remaining: 2.095204700 - signal count: 44636 (...snip...) Remaining: 86.829101800 - signal count: 1865855 Remaining: 87.126722500 - signal count: 1872251 Remaining: 87.427912300 - signal count: 1878704 Remaining: 87.730148000 - signal count: 1885171 Remaining: 88.029529500 - signal count: 1891599 Remaining: 88.329424200 - signal count: 1898039 Remaining: 88.629499900 - signal count: 1904482 Remaining: 88.926003300 - signal count: 1910861 ERROR: timeout. sleep thread seems hung.
using namespace std::chrono_literals;
constexpr auto acq = std::memory_order_acquire; constexpr auto rlx = std::memory_order_relaxed;
constexpr uint64_t sigHandlerSpin = 10'000; constexpr uint64_t sigSenderSpin = 100; constexpr int64_t sleepTimeNs = 20'000'000; // 20ms constexpr int64_t reportIntervalNs = 200'000'000; // 200ms constexpr auto timeout = 60s;
thread_local bool tls_isSleepThread{false};
std::atomic
struct timespec makeTs(time_t sec, int64_t nsec) { struct timespec ts = {}; ts.tv_sec = sec; ts.tv_nsec = nsec; return ts; }
void spinWait(uint64_t spinCount) { [[maybe_unused]] volatile int32_t dummy = 0; volatile int32_t divisor = 123456789; volatile int32_t dividend = 33; for (uint64_t i = 0; i < spinCount; i++) dummy = divisor / dividend; }
void signalHandler(int) { if (tls_isSleepThread) { s_signalCount++; spinWait(sigHandlerSpin); } s_signalPending = false; };
void spinBarrier(std::atomic
struct State {
std::atomic
[[maybe_unused]] void nanosleepThreadFn(State& state) { tls_isSleepThread = true; auto sleepTime = makeTs(0, sleepTimeNs); state.reportSec.store(sleepTime.tv_sec, rlx); state.reportNsec.store(sleepTime.tv_nsec, rlx);
spinBarrier(state.startBarrier);
while (!state.stop) {
if (::nanosleep(&sleepTime, &sleepTime) != -1)
break;
if (errno != EINTR)
break;
auto gen = state.reportGen.load(acq);
state.reportSec.store(sleepTime.tv_sec, rlx);
state.reportNsec.store(sleepTime.tv_nsec, rlx);
state.reportGen.store(gen + 1);
}
state.stop = true;
}
[[maybe_unused]] void stdSleepThreadFn(State& state) { tls_isSleepThread = true; auto sleepTime = 1ns * sleepTimeNs;
spinBarrier(state.startBarrier);
std::this_thread::sleep_for(sleepTime);
state.stop = true;
}
void reportThreadFn(State& state) { auto const delay = makeTs(0, reportIntervalNs);
spinBarrier(state.startBarrier);
while (!state.stop) {
// try to get consistent snapshot...
int64_t sec;
int64_t nsec;
for (size_t i = 0; i < 10; i++) {
auto const gen = state.reportGen.load(acq);
spinWait(10);
sec = state.reportSec.load(rlx);
nsec = state.reportNsec.load(rlx);
spinWait(10);
auto const gen2 = state.reportGen.load(acq);
if (gen2 == gen)
break;
}
std::cout << "Remaining: " << sec << "."
<< std::setfill('0') << std::setw(9) << nsec
<< " - signal count: " << s_signalCount
<< std::endl;
nanosleep(&delay, nullptr);
}
}
int runTest() { s_signalCount = 0; s_signalPending = false;
// install SIGURG handler...
struct sigaction oldSigUrgAction = {};
struct sigaction sigUrgAction = {};
sigUrgAction.sa_handler = signalHandler;
sigaction(SIGURG, &sigUrgAction, &oldSigUrgAction);
auto* const state = new State{};
std::thread sleepThread{[state] { nanosleepThreadFn(*state); }};
std::thread reportThread{[state] { reportThreadFn(*state); }};
auto const sleepThreadHandle = sleepThread.native_handle();
spinBarrier(state->startBarrier);
using Clock = std::chrono::steady_clock;
auto const t0 = Clock::now();
// Hammer the sleep thread with signals...
bool isTimedOut = false;
while (!state->stop && !isTimedOut) {
if (!s_signalPending) {
s_signalPending = true;
spinWait(sigSenderSpin);
pthread_kill(sleepThreadHandle, SIGURG);
}
if (Clock::now() - t0 > timeout) {
isTimedOut = true;
break;
}
}
state->stop = true;
reportThread.join();
if (isTimedOut) {
std::cout << "ERROR: timeout. sleep thread seems hung." << std::endl;
// Leak sleepThread and state to avoid crash
sleepThread.detach();
} else {
sleepThread.join();
delete state;
}
// Restore old SIGCHLD action.
sigaction(SIGURG, &oldSigUrgAction, nullptr);
return isTimedOut ? 1 : 0;
}
int main() { return runTest(); }
Extended Description
lib++ uses nanosleep in __libcpp_thread_sleep_for like this:
while (nanosleep(&ts, &ts) == -1 && errno == EINTR);
On Linux, such a nanosleep loop can run forever because of a Linux bug that's documented here: https://man7.org/linux/man-pages/man2/nanosleep.2.html#BUGS
For this to happen the thread needs to receive signals at a high frequency which seems to be rather uncommon.
https://github.com/golang/proposal/blob/5b63da9579c3b19294be614dcad33e20a9a4ad22/design/24543-non-cooperative-preemption.md https://github.com/golang/go/blob/go1.15.6/src/runtime/signal_unix.go#L347 https://github.com/golang/go/blob/go1.15.6/src/runtime/preempt.go#L223
This could lead to problems when on Linux, native code that uses libc++ is called from a Go thread. The Go thread executing this_thread::sleep_for will loop forever calling nanosleep and the other Go thread that tries to preempt it will also loop forever, perpetually trying to preempt it.
I know this is a Linux bug, but since a lot of people are using Linux it might be worth implementing a workaround in libc++.