Request: SDL_AccurateDelay()

TylerGlaiel commented 2 months ago

sample implementation, requires no platform specific code SDL_Delay, undershooting by a set amount, then wake up and spin until the accurate amount of time has passed adjust the undershoot amount if it detects that it overslept

//accurate sleep
void SDL_AccurateDelay(double time_ms) {
    static double wait_resolution = 1;
    static uint64_t perf_freq = SDL_GetPerformanceFrequency();
    int64_t current_ticks = SDL_GetPerformanceCounter();
    int64_t wait_until = current_ticks + time_ms * perf_freq/1000.0;
    int wait_time_ms = floor(time_ms - wait_resolution);

    //sleep for as long is safe without overshooting
    if(wait_time_ms >= 1) {
        SDL_Delay(wait_time_ms);
    }

    int64_t sleep_diff = (int64_t)SDL_GetPerformanceCounter() - wait_until;

    if(sleep_diff > 0) {
        double oversleep_ms = sleep_diff*1000.0/perf_freq;
        //we overshot, the wait resolution is probably too low.
        //tweak wait_resolution for the next time this is called (slightly)
        //if the diff was more than 5ms just assume its a weird OS hiccup and not an issue with resolution
        if(oversleep_ms < 5) {
            wait_resolution += oversleep_ms+0.1;

            //5ms is a lot so lets have that be the max. 
            if(wait_resolution > 5) wait_resolution = 5;
            //todo: should we adjust the resolution back down if we detect that we're consistently undersleeping by a lot?
            //since we're just spinning here theres room to do a little extra work
        }
    }

    //spin till the target time
    while(SDL_GetPerformanceCounter() < wait_until) {
        YieldProcessor(); //expands to _mm_pause; intrinsic on windows, not the same thing as thread yield
    }
}

slouken commented 2 months ago

I understand why you want this, but this is loaded with footguns. It makes more sense in the specific context of frame pacing than a general function that people might use everywhere and wonder why their application is using so much CPU time.

@icculus, thoughts?

TylerGlaiel commented 2 months ago

I mean I also kind of think its a footgun that SDL_Delay is very inaccurate (I've seen a lot of bad sample code out there that uses it incorrectly as a result)

it seems there's some processor intrinsic / asm that can help with the cpu usage here (__mm_pause()) but I dont really know quite how those work (edit, on windows shove YieldProcessor(); in the spin loop))

flibitijibibo commented 2 months ago

We implemented a similar thing for FNA which tries to factor in scheduler precision:

https://github.com/FNA-XNA/FNA/commit/46216d6cd1ff832eaafa2aef96a088b13a474b25

It does work but it's also C# so it's probably not as good as it could be. I dunno what an SDL variant would look like but it's at least another example where such a function would probably simplify things a lot and benefit other frameworks too.

slouken commented 2 months ago

Fair enough, we'll consider this for SDL3

slouken commented 2 months ago

FYI, there's an interesting discussion of this topic at https://blog.bearcats.nl/perfect-sleep-function/

TylerGlaiel commented 2 months ago

Can you verify that this is actually needed on Windows?

Yeah, SDL_Delay forwards to SDL_DelayNS (which I do believe was compiled with that flag on) and it was regularly oversleeping, anywhere from 0 to 1ms on my machine

Also, you don't actually want to yield the processor, because you might be rescheduled much later than you want.

YieldProcessor(); on windows expands to the __mm_pause; intrinsic, which is documented as "an instruction that makes spin loops use less energy" / "a 140-cycle noop" https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_pause&ig_expand=4897 I'm not an expert on this but I believe thats a different thing than yielding the thread back to the scheduler? I appear to get accurate sleeps when I do that

With that in there it seems to be sleeping the exact amounts requested, since 140 cycles is well under the precision of GetPerformanceCounter anyway

TylerGlaiel commented 2 months ago

also an alternative formulation of the same thing would be SDL_WaitUntil(int64 QPCTime);, unsure whether that or AccurateDelay is nicer

slouken commented 2 months ago

Okay, I'm making SDL_DelayNS() the precise sleeping function in SDL.

Here's an adaptation of computerBear's sleep function to compare with SDL's method:

// The PERFECT sleeping function for Windows.
// - Sleep times accurate to 1 microsecond
// - Low CPU usage
// - Runs on Windows Vista and up

#include <Windows.h>
#include <SDL3/SDL.h>
#include <SDL3/SDL_main.h>
#pragma comment(lib, "Winmm.lib") // timeGetDevCaps, timeBeginPeriod

HANDLE Timer;
int SchedulerPeriodMs;
INT64 QpcPerSecond;

void PreciseSleep(double seconds)
{
    LARGE_INTEGER qpc;
    QueryPerformanceCounter(&qpc);
    INT64 targetQpc = (INT64)(qpc.QuadPart + seconds * QpcPerSecond);

    if (Timer) // Try using a high resolution timer first.
    {
        const double TOLERANCE = 0.001'02;
        INT64 maxTicks = (INT64)SchedulerPeriodMs * 9'500;
        for (;;) // Break sleep up into parts that are lower than scheduler period.
        {
            double remainingSeconds = (targetQpc - qpc.QuadPart) / (double)QpcPerSecond;
            INT64 sleepTicks = (INT64)((remainingSeconds - TOLERANCE) * 10'000'000);
            if (sleepTicks <= 0)
                break;

            LARGE_INTEGER due;
            due.QuadPart = -(sleepTicks > maxTicks ? maxTicks : sleepTicks);
            SetWaitableTimerEx(Timer, &due, 0, NULL, NULL, NULL, 0);
            WaitForSingleObject(Timer, INFINITE);
            QueryPerformanceCounter(&qpc);
        }
    } else // Fallback to Sleep.
    {
        const double TOLERANCE = 0.000'02;
        double sleepMs = (seconds - TOLERANCE) * 1000 - SchedulerPeriodMs; // Sleep for 1 scheduler period less than requested.
        int sleepSlices = (int)(sleepMs / SchedulerPeriodMs);
        if (sleepSlices > 0)
            Sleep((DWORD)sleepSlices * SchedulerPeriodMs);
        QueryPerformanceCounter(&qpc);
    }

    while (qpc.QuadPart < targetQpc) // Spin for any remaining time.
    {
        YieldProcessor();
        QueryPerformanceCounter(&qpc);
    }
}

int main(int argc, char *argv[])
{
    // Initialization
    double target_delay = 1 / 60.0;
    double total_oversleep = 0.0;
    SDL_bool use_SDL = (argc == 2 && SDL_strcmp(argv[1], "--SDL") == 0);
    Timer = CreateWaitableTimerExW(NULL, NULL, CREATE_WAITABLE_TIMER_HIGH_RESOLUTION, TIMER_ALL_ACCESS);
    TIMECAPS caps;
    timeGetDevCaps(&caps, sizeof caps);
    timeBeginPeriod(caps.wPeriodMin);
    SchedulerPeriodMs = (int)caps.wPeriodMin;
    LARGE_INTEGER qpf;
    QueryPerformanceFrequency(&qpf);
    QpcPerSecond = qpf.QuadPart;

    // Game loop
    for (int i = 0; i < 100; ++i) {
        LARGE_INTEGER qpc0, qpc1;
        QueryPerformanceCounter(&qpc0);
        if (use_SDL) {
            SDL_DelayNS(SDL_NS_PER_SECOND / 60);
        } else {
            PreciseSleep(1 / 60.0);
        }
        QueryPerformanceCounter(&qpc1);
        double dt = (qpc1.QuadPart - qpc0.QuadPart) / (double)QpcPerSecond;
        double oversleep = (dt - target_delay);
        total_oversleep += oversleep;
        SDL_Log("Slept for %.2f ms (overslept %.2f ms)\n", 1000 * dt, 1000 * oversleep);
    }
    if (use_SDL) {
        SDL_Log("Used SDL_DelayNS() method, total overslept: %.2f ms\n", 1000 * total_oversleep);
    } else {
        SDL_Log("Used PreciseSleep() method, total overslept: %.2f ms\n", 1000 * total_oversleep);
    }
    return 0;
}

Here's the output on my machine over 10 runs, discarding scheduling outliers:

INFO: Used PreciseSleep() method, total overslept: 1.67 ms
INFO: Used PreciseSleep() method, total overslept: 3.10 ms
INFO: Used PreciseSleep() method, total overslept: 6.38 ms
INFO: Used PreciseSleep() method, total overslept: 4.75 ms
INFO: Used PreciseSleep() method, total overslept: 2.18 ms
INFO: Used PreciseSleep() method, total overslept: 2.83 ms
INFO: Used PreciseSleep() method, total overslept: 2.91 ms
INFO: Used PreciseSleep() method, total overslept: 6.61 ms
INFO: Used PreciseSleep() method, total overslept: 6.97 ms
INFO: Used PreciseSleep() method, total overslept: 10.62 ms

INFO: Used SDL_DelayNS() method, total overslept: 0.52 ms
INFO: Used SDL_DelayNS() method, total overslept: 1.60 ms
INFO: Used SDL_DelayNS() method, total overslept: 1.08 ms
INFO: Used SDL_DelayNS() method, total overslept: 2.87 ms
INFO: Used SDL_DelayNS() method, total overslept: 3.05 ms
INFO: Used SDL_DelayNS() method, total overslept: 1.24 ms
INFO: Used SDL_DelayNS() method, total overslept: 1.36 ms
INFO: Used SDL_DelayNS() method, total overslept: 8.88 ms
INFO: Used SDL_DelayNS() method, total overslept: 2.99 ms
INFO: Used SDL_DelayNS() method, total overslept: 1.84 ms

Note that both of these methods had occasional scheduling hiccoughs which caused "total overslept" to jump to over 100 ms.

slouken commented 2 months ago

Done! Thanks for the suggestion! :)

libsdl-org / SDL

Request: SDL_AccurateDelay() #10210