libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
10.18k stars 1.86k forks source link

SDL3: Bugs with SDL_SetRenderVSync()/SDL_App* callbacks on Windows #11093

Open nightmareci opened 1 month ago

nightmareci commented 1 month ago

High CPU usage can be observed on Windows with vsync at 1, after making this change to the snake game example. The high CPU usage might be related to the code driving the SDL_App* callbacks, though:

// ...
    if (!SDL_CreateWindowAndRenderer("examples/game/snake", SDL_WINDOW_WIDTH, SDL_WINDOW_HEIGHT, 0, &as->window, &as->renderer)) {
        return SDL_APP_FAILURE;
    }
    if (!SDL_SetRenderVSync(as->renderer, 1)) {
        return SDL_APP_FAILURE;
    }
// ...

Setting vsync to SDL_WINDOW_SURFACE_VSYNC_DISABLED works fine, no unusual high CPU usage then. I've also seen some other issues with SDL_SetRenderVSync() on Windows, like all vsync values besides SDL_WINDOW_SURFACE_VSYNC_DISABLED and 1 not working (That operation is not supported returned by SDL_GetError()).

AntTheAlchemist commented 1 month ago

I've just starting trying out the SDL_App* callbacks. In my own basic app everything seems okay, only CPU & GPU high usage on "0" vsync, as expected. I am getting a "op not supported" error on adaptive vsync (-1) though, due to this (is this how it should work?):

static bool D3D11_SetVSync(SDL_Renderer *renderer, const int vsync)
{
    D3D11_RenderData *data = (D3D11_RenderData *)renderer->internal;

    if (vsync < 0) {
        return SDL_Unsupported();
    }

Just tried snake. It's thrashing the CPU & GPU by default. But it works fine when inserting SDL_SetRenderVSync(as->renderer, 1), so I'm not seeing the same there.

[Stupid question alert] The snake game's AppState data - isn't that at risk of being accessed by two threads at the same time if SDL_AppIterate and SDL_AppEvent are running together on different threads? "There is (currently) no guarantee about what thread this will be called from"

nightmareci commented 1 month ago

[Stupid question alert] The snake game's AppState data - isn't that at risk of being accessed by two threads at the same time if SDL_AppIterate and SDL_AppEvent are running together on different threads? "There is (currently) no guarantee about what thread this will be called from"

Yeah, I've wondered about that aspect of the snake game's code. The event function is writing, iterate function reading, so that's a race condition if the functions are on different threads, as far as I can tell. A potential "easy fix" is to force a shared mutex lock for both functions, so they're kept synchronized, either on the user's side, or SDL's side.

slouken commented 1 month ago

@icculus, what's our guarantee about event dispatch with respect to app iteration? If it's guaranteed to be multi-threaded on some platforms, that's going to be really awkward for application code.

icculus commented 1 month ago

I think AppEvent and AppIterate already serialize...? If it doesn't, it's a bug.

icculus commented 1 month ago

Also, if vsync==1 and the CPU usage is high, then a) the OS is busywaiting for the sync event or b) it didn't work at all and you're running frames as fast as possible.

If setting vsync==0 causes CPU to reduce, too, something somewhere went really wrong.

AntTheAlchemist commented 1 month ago

I've just re-read the SDL_AppEvent docs with a better understanding (apologies - I'm new at threads and still learning). Serialisation is mentioned, so there is no data race issue - my question was indeed stupid.

icculus commented 1 month ago

I haven't tried this on Windows yet, but here's a test app:

#define SDL_MAIN_USE_CALLBACKS 1
#include <SDL3/SDL.h>
#include <SDL3/SDL_main.h>

// This is probably a deeply stupid way to do this.
typedef struct FpsMetrics
{
    Uint64 rolling_ticks;  // sum of the history.
    Uint64 rolling_iterations; // sum of the history.
    Uint64 start_ticks;
    Uint64 iterations;
    Uint64 tick_history[8];
    Uint64 iteration_history[8];
    int history_idx;
} FpsMetrics;

static Uint64 fps_iteration(FpsMetrics *metrics)
{
    const Uint64 now = SDL_GetTicks();
    Uint64 elapsed = now - metrics->start_ticks;

    metrics->iterations++;

    if (elapsed >= 125) {
        const Uint64 iterations = metrics->iterations;
        const size_t histlen = SDL_arraysize(metrics->tick_history);
        const int idx = metrics->history_idx;
        metrics->rolling_ticks -= metrics->tick_history[idx];
        metrics->rolling_ticks += elapsed;
        metrics->rolling_iterations -= metrics->iteration_history[idx];
        metrics->rolling_iterations += iterations;
        metrics->tick_history[idx] = elapsed;
        metrics->iteration_history[idx] = iterations;
        metrics->start_ticks = now;
        metrics->iterations = 0;
        metrics->history_idx++;
        //printf("FPS %p rolling_ticks=%u, rolling_iterations=%u\n", metrics, (unsigned int) metrics->rolling_ticks, (unsigned int) metrics->rolling_iterations);
        if (metrics->history_idx >= histlen) {
            metrics->history_idx = 0;
        }
        elapsed = 0;
    }

    const Uint64 ticks = metrics->rolling_ticks + elapsed;
    const Uint64 iterations = metrics->rolling_iterations + metrics->iterations;
    return ticks ? ((Uint64) ((((double) iterations) / (((double) ticks) / 1000.0)) + 0.5)) : 0;
}

static SDL_Window *window = NULL;
static SDL_Renderer *renderer = NULL;
static FpsMetrics fps_metrics;
static int vsync = 1;

SDL_AppResult SDL_AppInit(void **appstate, int argc, char *argv[])
{
    if (!SDL_Init(SDL_INIT_VIDEO)) {
        SDL_Log("Couldn't initialize SDL: %s", SDL_GetError());
        return SDL_APP_FAILURE;
    } else if (!SDL_CreateWindowAndRenderer("testvsync", 640, 480, 0, &window, &renderer)) {
        SDL_Log("Couldn't create window/renderer: %s", SDL_GetError());
        return SDL_APP_FAILURE;
    }

    SDL_SetRenderVSync(renderer, vsync);

    return SDL_APP_CONTINUE;
}

SDL_AppResult SDL_AppEvent(void *appstate, SDL_Event *event)
{
    if (event->type == SDL_EVENT_QUIT) {
        return SDL_APP_SUCCESS;
    } else if (event->type == SDL_EVENT_KEY_DOWN) {
        if (event->key.key == SDLK_ESCAPE) {
            return SDL_APP_SUCCESS;
        } else if (event->key.key == SDLK_V) {
            vsync = vsync ? 0 : 1;
            SDL_SetRenderVSync(renderer, vsync);
        }
    }
    return SDL_APP_CONTINUE;
}

SDL_AppResult SDL_AppIterate(void *appstate)
{
    char vsyncstr[64];
    char fpsstr[64];
    int w, h;
    float x, y;

    SDL_snprintf(vsyncstr, sizeof (vsyncstr), "vsync: %s", vsync ? "ON" : "OFF");
    SDL_snprintf(fpsstr, sizeof (fpsstr), "FPS: %" SDL_PRIu64, fps_iteration(&fps_metrics));

    SDL_GetRenderOutputSize(renderer, &w, &h);

    SDL_SetRenderDrawColor(renderer, 0, 0, 0, 255);
    SDL_RenderClear(renderer);
    SDL_SetRenderDrawColor(renderer, 255, 255, 255, 255);

    x = (float) ((w - ((int) (9 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y = (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 4);
    SDL_RenderDebugText(renderer, x, y, vsyncstr);

    x = (float) ((w - ((int) (8 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y = (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 2);
    SDL_RenderDebugText(renderer, x, y, fpsstr);

    x = (float) ((w - ((int) (24 * SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE))) / 2);
    y += (float) ((h - SDL_DEBUG_TEXT_FONT_CHARACTER_SIZE) / 4);
    SDL_RenderDebugText(renderer, x, y, "Press V to toggle vsync.");

    SDL_RenderPresent(renderer);

    return SDL_APP_CONTINUE;
}

void SDL_AppQuit(void *appstate, SDL_AppResult result)
{
}

On x11 (well, XWayland), this gets like 3500+ fps on my laptop and eats like 65% of the CPU without vsync, and with vsync, it stays a solid 60fps and uses 3-5% of the CPU.

If it does the right thing on Windows, we can probably close this.

nightmareci commented 1 month ago

I tested that on Windows, it produces high CPU usage with vsync both on or off, using the same amount of high CPU percentage in either case, but displays 165 FPS with either setting (my monitor was running at 165 Hz while testing). One odd thing though, holding the mouse button on the title bar and moving the window around, the FPS goes down to ~60 FPS and CPU usage goes to 0%.

I also tested it on the same machine running Linux, both with X11/XWayland and native Wayland (SDL_VIDEO_DRIVER=wayland), I got unbounded FPS with vsync off like you reported, but with vsync on it runs at 165 FPS (my monitor's set refresh rate during testing, again).

Isn't the X11/Wayland behavior incorrect? I thought that the iterate callback was paced out by SDL using SDL_DelayPrecise(), controlled by SDL_HINT_MAIN_CALLBACK_RATE, with that at 60 FPS by default? And since your code doesn't change the callback rate, it should be running at 60 FPS, even on my setup, with 165 Hz refresh rate.

maia-s commented 1 month ago

I tested it on macos (mbp M2 pro) and got some weird behaviour. I have two screens connected, one 120 hz (the laptop screen) and one 60 hz.

With vsync enabled, it syncs to 60 on the 60hz screen and 120 on the 120hz screen, and it adapts as I move the window between screens. That's nice! It uses around 8-10% cpu for 60hz, but WindowServer uses 30-40% on top of that.

With vsync disabled, without moving between screens with vsync off, it fluctuates between 300 and 1000 fps on either screen, usually hovering around 5-600. It seems a little low for something as simple as this. It uses ~50% cpu.

When I move the window between screens with vsync disabled, it syncs to the screens's refresh rate after being moved to the new screen. Turning vsync on or off has no effect after this. Actually, after having done this it always syncs to 60 on the 60hz screen with vsync disabled even without moving the window after quitting and restarting the test program. On the 120hz screen it still runs unlimited with vsync off.

nightmareci commented 1 month ago

I think I've narrowed down what's causing this issue: Iterations aren't being paced out at all, in the generic main callbacks implementation; that line's if-expression I linked to is always true, when any windows exist. So, the only "pacing" we're observing, where the generic implementation is in use, is whatever rate is produced by blocking vsync, so the callback rate hint ends up entirely unused.

I propose that the callback rate be respected and maintained with SDL_DelayPrecise() either when no windows exist or when all windows are using some combination of SDL_WINDOW_SURFACE_VSYNC_DISABLED and SDL_WINDOW_SURFACE_VSYNC_ADAPTIVE. Should there be no pacing at all when the callback rate hint is 0, so without vsync, the user's iteration callback is expected to do the pacing logic? But, with nonzero callback rate, there's some discussion to be had around what pacing behavior should be produced when vsync is in effect, or there are multiple windows on different displays with different refresh rates while vsync is in effect on some/all windows.

It might even make sense to let the user know when to present/not present. E.g., adding a new API returning a bool, maybe even taking a window, if true then "you can present this iteration to get the callback rate requested, because vsync is on right now", to get the correct behavior with a callback rate of 120 with 60 Hz vsync (rendering every iteration, the iteration callback's own code having no timing logic, the callback rate would never be observed); that function would also allow the user to optimize rendering per window, not rendering to and presenting to a window at all when it returns false for a window. Or the functions that normally introduce blocking vsync don't present/block, the iteration callback driver causing blocking presents per window when due, for the sake of getting the callback rate. I'm not familiar with how vsync blocking works with multiple windows each with vsync on in a single threaded program though, do you get a block for each window present, introducing a lot of delay in the iteration callback's thread (60 Hz on two displays with a window on each display, you get 2 / 60 seconds of delay when presenting to both windows in one iteration)?