llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.94k stars 11.53k forks source link

this_thread::sleep_for(short duration) can sleep forever on Linux #47904

Open llvmbot opened 3 years ago

llvmbot commented 3 years ago
Bugzilla Link 48560
Version 11.0
OS Linux
Reporter LLVM Bugzilla Contributor
CC @mclow

Extended Description

lib++ uses nanosleep in __libcpp_thread_sleep_for like this:

while (nanosleep(&ts, &ts) == -1 && errno == EINTR);

On Linux, such a nanosleep loop can run forever because of a Linux bug that's documented here: https://man7.org/linux/man-pages/man2/nanosleep.2.html#BUGS

For this to happen the thread needs to receive signals at a high frequency which seems to be rather uncommon.

https://github.com/golang/proposal/blob/5b63da9579c3b19294be614dcad33e20a9a4ad22/design/24543-non-cooperative-preemption.md https://github.com/golang/go/blob/go1.15.6/src/runtime/signal_unix.go#L347 https://github.com/golang/go/blob/go1.15.6/src/runtime/preempt.go#L223

This could lead to problems when on Linux, native code that uses libc++ is called from a Go thread. The Go thread executing this_thread::sleep_for will loop forever calling nanosleep and the other Go thread that tries to preempt it will also loop forever, perpetually trying to preempt it.

I know this is a Linux bug, but since a lot of people are using Linux it might be worth implementing a workaround in libc++.

llvmbot commented 3 years ago

I was able to reproduce the problem with the following program (see below). It won't reproduce well on a loaded machine and you may have to fiddle with the spin counts to get a successful repro.

In my tests, I was able to reproduce the problem on:

Note that this cannot be reproduced with WSL1 since Microsoft's implementation of the syscall does not have this bug. It can be reproduced with WSL2 though.

Output from my machine:

Remaining: 0.020000000 - signal count: 0 Remaining: 0.312026100 - signal count: 6322 Remaining: 0.610834500 - signal count: 12729 Remaining: 0.907113100 - signal count: 19096 Remaining: 1.202000100 - signal count: 25452 Remaining: 1.499204500 - signal count: 31842 Remaining: 1.799767600 - signal count: 38277 Remaining: 2.095204700 - signal count: 44636 (...snip...) Remaining: 86.829101800 - signal count: 1865855 Remaining: 87.126722500 - signal count: 1872251 Remaining: 87.427912300 - signal count: 1878704 Remaining: 87.730148000 - signal count: 1885171 Remaining: 88.029529500 - signal count: 1891599 Remaining: 88.329424200 - signal count: 1898039 Remaining: 88.629499900 - signal count: 1904482 Remaining: 88.926003300 - signal count: 1910861 ERROR: timeout. sleep thread seems hung.


include

include

include

include

include

include

include

using namespace std::chrono_literals;

constexpr auto acq = std::memory_order_acquire; constexpr auto rlx = std::memory_order_relaxed;

constexpr uint64_t sigHandlerSpin = 10'000; constexpr uint64_t sigSenderSpin = 100; constexpr int64_t sleepTimeNs = 20'000'000; // 20ms constexpr int64_t reportIntervalNs = 200'000'000; // 200ms constexpr auto timeout = 60s;

thread_local bool tls_isSleepThread{false}; std::atomic s_signalCount{0}; std::atomic s_signalPending{false};

struct timespec makeTs(time_t sec, int64_t nsec) { struct timespec ts = {}; ts.tv_sec = sec; ts.tv_nsec = nsec; return ts; }

void spinWait(uint64_t spinCount) { [[maybe_unused]] volatile int32_t dummy = 0; volatile int32_t divisor = 123456789; volatile int32_t dividend = 33; for (uint64_t i = 0; i < spinCount; i++) dummy = divisor / dividend; }

void signalHandler(int) { if (tls_isSleepThread) { s_signalCount++; spinWait(sigHandlerSpin); } s_signalPending = false; };

void spinBarrier(std::atomic& barrier) { barrier--; while (barrier.load(acq)) { // wait... } }

struct State { std::atomic startBarrier{3}; std::atomic stop{false}; std::atomic reportGen{0}; std::atomic reportSec{0}; std::atomic reportNsec{0}; };

[[maybe_unused]] void nanosleepThreadFn(State& state) { tls_isSleepThread = true; auto sleepTime = makeTs(0, sleepTimeNs); state.reportSec.store(sleepTime.tv_sec, rlx); state.reportNsec.store(sleepTime.tv_nsec, rlx);

spinBarrier(state.startBarrier);

while (!state.stop) {
    if (::nanosleep(&sleepTime, &sleepTime) != -1)
        break;
    if (errno != EINTR)
        break;

    auto gen = state.reportGen.load(acq);
    state.reportSec.store(sleepTime.tv_sec, rlx);
    state.reportNsec.store(sleepTime.tv_nsec, rlx);
    state.reportGen.store(gen + 1);
}

state.stop = true;

}

[[maybe_unused]] void stdSleepThreadFn(State& state) { tls_isSleepThread = true; auto sleepTime = 1ns * sleepTimeNs;

spinBarrier(state.startBarrier);
std::this_thread::sleep_for(sleepTime);

state.stop = true;

}

void reportThreadFn(State& state) { auto const delay = makeTs(0, reportIntervalNs);

spinBarrier(state.startBarrier);
while (!state.stop) {
    // try to get consistent snapshot...
    int64_t sec;
    int64_t nsec;
    for (size_t i = 0; i < 10; i++) {
        auto const gen = state.reportGen.load(acq);
        spinWait(10);
        sec = state.reportSec.load(rlx);
        nsec = state.reportNsec.load(rlx);
        spinWait(10);
        auto const gen2 = state.reportGen.load(acq);
        if (gen2 == gen)
            break;
    }

    std::cout << "Remaining: " << sec << "."
              << std::setfill('0') << std::setw(9) << nsec
              << " - signal count: " << s_signalCount
              << std::endl;
    nanosleep(&delay, nullptr);
}

}

int runTest() { s_signalCount = 0; s_signalPending = false;

// install SIGURG handler...
struct sigaction oldSigUrgAction = {};
struct sigaction sigUrgAction = {};
sigUrgAction.sa_handler = signalHandler;
sigaction(SIGURG, &sigUrgAction, &oldSigUrgAction);

auto* const state = new State{};

std::thread sleepThread{[state] { nanosleepThreadFn(*state); }};
std::thread reportThread{[state] { reportThreadFn(*state); }};

auto const sleepThreadHandle = sleepThread.native_handle();

spinBarrier(state->startBarrier);

using Clock = std::chrono::steady_clock;
auto const t0 = Clock::now();

// Hammer the sleep thread with signals...
bool isTimedOut = false;
while (!state->stop && !isTimedOut) {
    if (!s_signalPending) {
        s_signalPending = true;
        spinWait(sigSenderSpin);
        pthread_kill(sleepThreadHandle, SIGURG);
    }

    if (Clock::now() - t0 > timeout) {
        isTimedOut = true;
        break;
    }
}

state->stop = true;
reportThread.join();

if (isTimedOut) {
    std::cout << "ERROR: timeout. sleep thread seems hung." << std::endl;
    // Leak sleepThread and state to avoid crash
    sleepThread.detach();
} else {
    sleepThread.join();
    delete state;
}

// Restore old SIGCHLD action.
sigaction(SIGURG, &oldSigUrgAction, nullptr);

return isTimedOut ? 1 : 0;

}

int main() { return runTest(); }