libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.59k stars 1.78k forks source link

Improving Windows OpenGL VSync in windowed mode #5797

Open PJB3005 opened 2 years ago

PJB3005 commented 2 years ago

I'm trying to move my app from GLFW to SDL2 and one of the first things I noticed is that OpenGL vsync stutters a lot on SDL2.

The problem here appears to be that SwapBuffers(HDC) synchronizes to the monitor instead of to DWM (Windows' desktop compositor). While the frame timings are an extremely solid 16.666~ms, It does not line up with when DWM needs the frame to be able to composit it to the screen. GLFW calls DwmFlush() before running SwapBuffers(HDC) if they detect DWM is running, and this seems to work pretty solidly: frame timing graph isn't as smooth as before, but there is no stutter. Also requires some juggling with wglSwapInterval() since we have to report 0 when we detect DWM isn't running... Ugh. It's also not clear to me how this API is supposed to work with multi-monitor scenarios (DWM is supposed[^1] to be able to run multiple monitors with different refresh rate smoothly)

To elaborate on frame times with DwmFlush(): they don't seem significantly worse than IDXGISwapchain::Present(), which also does not stutter. If I load my system (by just recompiling my project) the frame times do get less smooth (that is, my program is woken up at less consistent intervals), but the frames are still presented at a solid 60 Hz. Because of the inconsistent frame times this would still mess up any game loop relying on delta time though, unless my game's game loop just sucks. I wonder if there is a more accurate way to get timing info to avoid this? Also of course if your game is CPU intensive, "losing 2ms this frame and getting 2ms extra next frame" could probably be a dealbreaker to cause dropped frames.

Some potentially relevant info I found online:

Reliable windowed vsync with OpenGL on Windows? - StackOverflow: contains a lot of testing. In the comments they mention they found DwmFlush() to be the most reliable (when called after SwapBuffers(), in contrast to what GLFW does). They also mention an API called D3DKMTWaitForVerticalBlankEvent and an issue on the Chromium bug tracker investigating this stuff:

Issue 467617: How to dramatically improve Chrome's requestAnimationFrame VSYNC accuracy in Windows: the aforementioned Chromium issue. It mentions the DwmFlush() timing inconsistency up above. A lot of the discussion revolves around D3DKMTWaitForVerticalBlankEvent, which is intended to be a driver API to do vsync. The suspicious thing is that, from what I can gather, this API sits below DWM and therefore would not be useful for syncing to DWM itself again, so you have the exact same problem. I'm really not sure what to make of this issue. What I especially do not get is why they don't just vsync with IDXGISwapchain::Present()? (reminder, browsers use ANGLE so they use D3D underneath). They mention having to turn off ANGLE vsync because DWM already does vsync. This makes no sense to me since, as far as I can tell, IDXGISwapchain::Present() works fine with vsync even if DWM is running. Now this probably isn't too relevant to SDL2 since, well, OpenGL not D3D, but it still confuses the hell out of me.

For what it's worth, the relevant testing from me here was done on an Optimus system with an nvidia GPU and Intel iGPU.

[^1]: DWM smoothly working with different refresh rates per monitor doesn't really seem to work for me. My setup is an Optimus laptop with the primary monitor being 75 Hz attached to the dGPU over HDMI, while the internal display is at 60 Hz. I cannot get Windows to vsync programs on the laptop monitor at 60 Hz and the primary monitor at 75 Hz at all (even using DXGI to present etc)... I just put the primary monitor at 60 Hz because those extra 15 Hz aren't worth my secondary monitor stuttering like mad.

slime73 commented 2 years ago

If it helps, here's a workaround using DwmFlush conservatively that I implemented for my own code. It's only had a lot of testing on my own system though, so maybe there are corner cases it doesn't handle. https://github.com/love2d/love/blob/5175b0d1b599ea4c7b929f6b4282dd379fa116b8/src/modules/window/sdl/Window.cpp#L1018

PJB3005 commented 2 years ago

Ah yeah I did come by the LOVE2D issue while I was googling around for this. Figured we ought to implement this upstream if possible though.

For what it's worth here is the GLFW logic. Some things to note:

  1. They do not dynamically mess with swap interval, and yes this causes bugs for them (e.g. entering/leaving fullscreen sometimes requires resetting vsync because the swap interval got wrong)
  2. They implement a swap interval > 1 with a loop, if set.
PJB3005 commented 2 years ago

Did some more testing. I did state that DwmFlush() didn't seem much worse than IDXGISwapchain::Present(), BUT that only seems to be the case if you're not using DXGI flip model. with DXGI flip model (what D3D apps should be using ever since Windows 8) my frame timings are again smooth as butter, even under strong system load.

It's probably not possible for SDL to transparently make use of this though. Apps could theoretically use WGL_NV_DX_interop2 to use a DXGI swapchain with OpenGL to take advantage of this (I guess I ought to experiment with this).

mirh commented 1 year ago

I noticed is that OpenGL vsync stutters a lot on SDL2.

Could it be because you are forced to use this awful roundabout? https://learn.microsoft.com/en-us/windows-hardware/drivers/display/rendering-on-a-discrete-gpu-using-cross-adapter-resources#redirected-bitblt-presentation-model AFAICT from by my brief PresentMon fiddling, windowed opengl (even on desktop cards!) is always behaving like that.

My setup is an Optimus laptop with the primary monitor being 75 Hz attached to the dGPU over HDMI, while the internal display is at 60 Hz.

That seems a recipe for disaster (or if not any confusion) Do you know what your "main" GPU even is? It could be:

Apps could theoretically use WGL_NV_DX_interop2 to use a DXGI swapchain with OpenGL to take advantage of this

Guy in https://github.com/ppy/osu/issues/8165#issuecomment-730298043 reported it was working indeed. Now.. the question, is there any way to avoid the GDI-copy step (putting aside exclusive fullscreen of course)? If not, are you always going to play some perverse game of whack-a-mole (i.e. are you always going to be like one? two? very shakily synchronized frames behind?) or is there some amount of quirks that can nail it fairly?

Nvidia mentions some "minor memory and performance overhead" in their latest drivers that are basically enabling this for every program (well, at least optimus laptops.. note: don't use any newer than 526.47 if you want to keep results comparable to your previous ones), but if "functionally" results were strictly better you'd really need some hefty hit to justify the native solution IMHO.

PJB3005 commented 1 year ago

That seems a recipe for disaster (or if not any confusion) Do you know what your "main" GPU even is? It could be:

dGPU is wired to HDMI port, iGPU is to internal monitor.

Realistically though I don't think there's a good fix here other than "stop using WGL for present on Windows". Either IHVs don't care to make it usable or Microsoft makes it impossible to correctly integrate with the display stack. Even Vulkan isn't comparable to DXGI on this kind of present stuff.

mirh commented 1 year ago

Indeed, after reading the hell out of this, I'm starting to sway in that direction too. https://www.youtube.com/watch?v=E3wTajGZOsA https://old.reddit.com/r/Windows10/comments/c5ahfc/bug_dx9dx11_games_do_not_enter_independent_flip/ https://www.gamedev.net/forums/topic/679050-how-come-changing-dxgi_swap_chain_descbuffercount-has-no-effect/5294478/

I have been wondering for years and years and years why would DwmFlush improve the UX for some people (be it that they are more sensitive to stuttering/latency, or just that their computers are somehow less clement). Especially considering it's not even a syncing thing by itself.. but only a "wait until nothing is scheduled" throttle. Well, it turns out the limiting is exactly the feature, because the DWM composites unshared new frames right after vblank. That is: presentation of any final display frame will still happen that half a millisecond or something before scanout (just like everybody would imagine), but if your legacy application frame n°2 isn't ready mere moments after frame n°1 has been sent to the screen you'll be already late for the compositor to pick it up.. and it's gonna show on the monitor as frame n°3. And that's going to go smoothly only if you still vsync in windowed mode (to whatever the proper clock that you can find). Sure, you'll have a full extra frame of latency compared to vsync with exclusive fullscreen, but at least it's steady. If you were to keep your game unbound you are going soon to have a bad time with its presentation time window cyclically shifting up and down (unless you are doing so many hundreds of fps, that jitter becomes negligible I guess).

It's unbelievable how shittily documented this fact is (hell, even just figuring out that that when you are windowed-composited you shouldn't buffer requires digging the Mines of Moria), but I guess that the quick guidance was always "just call DwmEnableComposition" before Windows 8, and "just switch to our newer DXGI apis" after that. And of the few OGL developers with the competences, care and focus to dedicate to the issue, you'd also have to find one wanting to read between the lines of D3D architecture.

So.. uh, yeah? Long story short I think we should play nice with the Windows compositor by having buffers like it properly expects. Meaning the flip model and thus the OpenGL interop way (and maybe something like this could still allow keeping the WGL lingo). It's not nice to be second class citizens.. and just watch the incredible video I linked to see how much we are missing out.

It also made me realize that there exist two totally opposite aims that any one application could want to achieve. One is the Pro Gamer no compromises minimum latency and maximum power road that I guess most of us intuitively expect. The other is the mobile-friendly power-conscious path, where you still want everything to be smooth.. but you set a (more or less) arbitrary amount of "enough performance" and then from that you try to have the hardware staying to sleep as much as possible. But this is a topic for another time I guess (even though a redesign of presentation would have to pass through these considerations anyway).

TylerGlaiel commented 3 months ago

as discussed in #10160 I have a proof of concept that layering openGL on a DXGI swapchain is a feasible thing for SDL to do, and results in significantly smoother frame pacing + access to accurate frame timing information. Would love to see this supported for real https://github.com/TylerGlaiel/SDL-Frame-Pacing-Sample