libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.9k stars 1.83k forks source link

Bug in SDL_DelayNS() #10592

Closed nightmareci closed 3 weeks ago

nightmareci commented 2 months ago

There's a pretty serious bug in the accurate sleep algorithm in SDL_DelayNS(): For targets with 1 ms-precision sleep functions, this call of SDL_SYS_DelayNS() rounds down to a 0 ms-duration sleep request, resulting in high CPU/power usage. Right now, the only 1 ms-precision targets are versions of Windows that must use Sleep() and Emscripten. But who knows, maybe the seemingly higher precision sleep functions end up being 1 ms-precision in some configurations, behaving like zero duration sleep requests when the duration requested is less than 1 ms.

SDL_SYS_DelayNS(SCHEDULING_TIMESLICE_NS - SDL_NS_PER_US);

You could just remove the - SDL_NS_PER_US bit, so the call is always requesting 1 ms-or-more, resulting in no rounding down to 0 ms for any target. Or have some alternative implementation for 1 ms-precision targets. Or do something else entirely, just be sure every target makes nonzero duration sleep requests where that's desired.

But you could revert the accurate sleep algorithm altogether, leaving implementing an algorithm to users. That's certainly possible for them to do if SDL_DelayNS() is just a thin API over the target-specific sleep functions. That could be made into an SDL satellite library, probably just a single header library or simple .h/.c pair to be dropped into projects as-is.

andreasgrabher commented 2 months ago

I agree that SDL_DelayNS() should be reverted to the previous simpler version.

nightmareci commented 2 months ago

There is an additional consideration around iteration timing with the new SDL_App* callbacks though, since SDL itself does pacing on behalf of the user code. So maybe accurate sleeping of some sort is warranted there? I'm sure many users would greatly appreciate the sort of simplicity you get on game consoles, 60 Hz-timing-guaranteed, where it's perfectly safe to always wait on vsync.

nightmareci commented 3 weeks ago

Part of this issue is now handled, SDL_DelayNS() is now just a thin API over precise sleeping, as I suggested. But the algorithm in SDL_DelayPrecise() is still the same, so here's a freebie one I've cooked up that's not only as good or better than the one in SDL right now, but has lower CPU/power usage. Though, in the case of platforms with only 1ms precision sleeps natively, this algorithm ends up being on par with SDL's current one, though avoids the issue I reported of requesting sub-1ms delays in the initial part of the algorithm. In the case of sub-1ms precise delay requests, this algorithm ends up effectively just doing SDL_DelayNS(ns):

void SDL_DelayPrecise(Uint64 ns)
{
    Uint64 current_value = SDL_GetTicksNS();
    const Uint64 target_value = current_value + ns;

    // Sleep for a short number of cycles when real sleeps are desired.
    // We'll use 1 ms, it's the minimum guaranteed to produce real sleeps across
    // all platforms.
    const Uint64 SHORT_SLEEP_NS = 1 * SDL_NS_PER_MS;

    // Try to sleep short of target_value. If for some crazy reason
    // a particular platform sleeps for less than 1 ms when 1 ms was requested,
    // that's fine, the code below can cope with that, but in practice no
    // platforms behave that way.
    Uint64 max_sleep_ns = SHORT_SLEEP_NS;
    while (current_value + max_sleep_ns < target_value) {
        current_value = SDL_GetTicksNS();
        // Sleep for a short time
        SDL_SYS_DelayNS(SHORT_SLEEP_NS);
        const Uint64 next_sleep_ns = (SDL_GetTicksNS() - current_value);
        if (next_sleep_ns > max_sleep_ns) {
            max_sleep_ns = next_sleep_ns;
        }
        current_value = SDL_GetTicksNS();
    }

    // Do a shorter sleep of the remaining time here, less the max overshoot in
    // the first loop. Due to maintaining max_sleep_ns as
    // greater-than-or-equal-to-1 ms, we can always subtract off 1 ms to get
    // the duration overshot beyond a 1 ms sleep request; if the system never
    // overshot, great, it's zero duration. By choosing the max overshoot
    // amount, we're likely to not overshoot here. If the sleep here ends up
    // functioning like SDL_DelayNS(0) internally, that's fine, we just don't
    // get to do a more-precise-than-1 ms-resolution sleep to undershoot by a
    // small amount on the current system, but SDL_DelayNS(0) does at least
    // introduce a small, yielding delay on many platforms, better than an
    // unyielding busyloop.
    if (current_value < target_value && target_value - current_value > max_sleep_ns - SHORT_SLEEP_NS) {
        const Uint64 delay_ns = (target_value - current_value) - (max_sleep_ns - SHORT_SLEEP_NS);
        SDL_SYS_DelayNS(delay_ns);
        current_value = SDL_GetTicksNS();
    }

    // We've likely undershot target_value at this point by a pretty small
    // amount, but maybe not. The footgun case if not handled here is where
    // we've undershot by a large amount, like several ms, but still smaller
    // than the amount max_sleep_ns overshot by; in such a situation, the above
    // shorter-sleep block didn't do any delay, the if-block wasn't entered.
    // Also, maybe the shorter-sleep undershot by several ms, so we still don't
    // want to spin a lot then. In such a case, we accept the possibility of
    // overshooting to not spin much, or if overshot here, not at all, keeping
    // CPU/power usage down in any case. Due to scheduler sloppiness, it's
    // entirely possible to end up undershooting/overshooting here by much less
    // than 1 ms even if the current system's sleep function is only 1
    // ms-resolution, as SDL_GetTicksNS() generally is better resolution than 1
    // ms on the systems SDL supports.
    while (current_value + SHORT_SLEEP_NS < target_value) {
        SDL_SYS_DelayNS(SHORT_SLEEP_NS);
        current_value = SDL_GetTicksNS();
    }

    // Spin for any remaining time
    while (current_value < target_value) {
        SDL_CPUPauseInstruction();
        current_value = SDL_GetTicksNS();
    }
}
slouken commented 3 weeks ago

Your implementation loses the ability to delay less than the minimum OS sleep time, but overall performs better than the original implementation, with much less CPU usage:

Minimum nanosecond delay: 533100 ns
Timing 100 frames at 60 FPS
Overslept 42.55 ms
Minimum precise delay: 573500 ns
Timing 100 frames at 60 FPS
Overslept 0.23 ms

I'll go ahead and merge it, thanks!