rohanlean commented 2 years ago

Describe the project you are working on

Nothing so far, just having a look at Godot. 😃

Describe the problem or limitation you are having in your project

In some scenarios none of the currently exposed presentation strategies is both jitter-free and low-latency.

Describe the feature / enhancement and how it helps to overcome the problem or limitation

If the preparation of a frame consistently takes less time than the refresh interval of the display, then double buffered vsync saves one refresh interval of latency over the currently offered triple buffered vsync. Unlike immediate and mailbox presentation modes, vsync has consistent timing and therefore less jitter. With frame scheduling it can sometimes have better latency than those as well, depending on the variance of the frame time. Compared to other non-vsync modes it reduces power consumption and component wear.

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

For example a VSYNC_DOUBLE_BUFFERED enumerator for DisplayServer::VSyncMode

If this enhancement will not be used often, can it be worked around with a few lines of script?

No

Is there a reason why this should be core and not an add-on in the asset library?

Configuration of the swap chain is handled by core, and cannot be done elsewhere.

Calinou commented 2 years ago

Does Vulkan have a way to force the use of double-buffered V-Sync (over triple-buffered)? I remember Vulkan not offering a lot of control in this aspect. See discussion in the pull request where V-Sync options were reimplemented for Vulkan: https://github.com/godotengine/godot/pull/48622

rohanlean commented 2 years ago

I could not find a mention of double buffering in that discussion, but unlike OpenGL, Vulkan gives the application control over the buffering. See

https://github.com/godotengine/godot/blob/c55aa03c436fb40f9a42128397b69872d85fb237/drivers/vulkan/vulkan_context.cpp#L1571-L1574.

Here is also a demo that switches the buffering dynamically:

https://github.com/KhronosGroup/Vulkan-Samples/tree/master/samples/performance/swapchain_images

Calinou commented 2 years ago

Feel free to open a pull request to implement this feature :slightly_smiling_face:

rohanlean commented 2 years ago

I had implemented the naive solution, but it turns out that Mesa sets minImageCount to 4 on Wayland and 3 on X11, so this does not work as universally as I had hoped. I think one has to work around it by requesting mailbox mode and scheduling the frames appropriately. I will try to make that work tomorrow. Hopefully Vulkan gives some feedback on when the images are scanned out.

rohanlean commented 2 years ago

Hopefully Vulkan gives some feedback on when the images are scanned out.

Unfortunately this does not appear to be the case currently. KhronosGroup/Vulkan-Docs#370 already points to this issue. A solution has been in the works for over five years now. :confused:

Hopefully KhronosGroup/Vulkan-Docs#1364 will get there soon. If done right, it should allow the V-Sync and especially the Adaptive V-Sync options to be implemented such that they offer competitive, often superior, latency to Mailbox and V-Sync Off, while exhibiting less stutter and consuming fewer resources.

As I was reading the Vulkan spec and KhronosGroup/Vulkan-Docs#1137 I got the impression that Godot should not request at least 3 images in the swap chain, but minImageCount instead (meaning that it currently wastes one or two images of memory in some cases). This is probably minor, and could maybe lead to performance on some implementations, as the previously linked mobile demo seems to indicate; in that case minImageCount + 1 would probably be a better bet. There appears to be some confusion regarding the swap chain size among the spec, implementors, and users.

Unless someone else – with better knowledge of these APIs perhaps – has an idea on how to implement this proposal with what is currently available, I am afraid that it will have to be postponed. 😞

Edit:

The VK_KHR_present_wait extension was added to Vulkan last year, and it seems to suffice for a bit more than what I initially asked for. Unfortunately it is not yet supported by Mesa:

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12086

I would not be able to properly test a PR that I author.

Calinou commented 1 year ago

The VK_KHR_present_wait extension was added to Vulkan last year, and it seems to suffice for a bit more than what I initially asked for. Unfortunately it is not yet supported by Mesa:

gitlab.freedesktop.org/mesa/mesa/-/merge_requests/12086

I would not be able to properly test a PR that I author.

Looks like it's implemented in Mesa under X11, and is coming soon to Mesa under Wayland: https://www.phoronix.com/news/Mesa-KHR_present_wait-Wayland

Note that in situations where you can use VRR, I generally recommend using a FPS cap just below the refresh rate instead. This allows for the lowest possible latency while still not having any tearing, and handling framerate variations better than any other method. Enforcing double-buffered V-Sync will still be useful in situations where you can't use VRR (for instance, when you want black frame insertion from your display or don't want to see any VRR flicker).

darksylinc commented 1 year ago

Hi!

Since my PR godotengine/godot#80566 will be addressing this for Vulkan (and in theory it should apply to Metal & D3D12 backends if they make use of the parameter properly) I'll be addressing a few questions here to avoid derailing my PR's discussion (originally my PR fixes a synchronization bug):

There are two things that are linked together very tightly but are not exactly the same:

Number of backbuffers (or "swapchains" in Vulkan lingo).
Number of frames the CPU is allowed to go ahead the GPU.

Backbuffers or Swapchains

Backbuffer count is explained since the 90's: Consoles like the original NES had a single front buffer; which is always the one being presented to screen and the CPU had to iterate through every pixel faster than they're being sent to CRT scan; otherwise visible tearing would appear.

Then double buffering appeared. The CPU/GPU has all the time in the world to draw to the back buffer. And once it's ready, we must wait for the VBLANK interval; swap the front & back buffer; and what was once the front buffer is now the back buffer; and is now available for rendering the next frame.

Triple buffer uses 1 front buffer and 2 backbuffers. Which means the GPU doesn't have to wait for VBLANK interval, it can start writing into the 2nd backbuffer.

Number of frames the CPU is allowed to go ahead the GPU

The thing about rendering more than one frame is that it means we need double (or triple) of a lot of other things!

It's not just the swapchain. If we send a world matrix for a draw call; we need to store it somewhere in GPU memory so that our vertex shaders can use it.

That means for frame 0 we do vertex_shader.memory[0] = world_matrix for frame 1 vertex_shader.memory[1] = world_matrix and if we're doing triple buffering, then for frame 2 vertex_shader.memory[2] = world_matrix.

For frame 3 we must use vertex_shader.memory[0] = world_matrix again. But before that, we must wait (aka stall) for the GPU to finish frame 0; otherwise we could be writing from CPU to GPU memory that is still in use (aka race condition). Unless the GPU work is incredibly heavy, chance are that frame 0 is already done by the time CPU starts frame 4, so the wait returns immediately.

Swapchain count vs Buffer count

A high swapchain count allows the GPU to continue while the front buffer is blocked while being presented. A high buffer count allows the CPU to continue while the GPU is busy.

If the GPU is too slow, the buffer count is going to matter a lot to unblock the CPU. If the GPU is very fast, swapchain will dominate latency values.

I wrote a VERY long text. Then realized I was wrong. Then rewrote it again, realize I was wrong again. I guess true Nirvana is reached when I realize I know nothing.

In fact based on this, I will have to change the PR to expose both settings separately (right now it is set so that kNumSwapchains = kNumBuffers + 1).

The truth is, this has very complex interactions so I decided to write a simulator instead for FIFO presentation.

For example the following parameters:

static const size_t kNumBuffers = 2u;
static const size_t kNumSwapchains = 3u;

static const size_t kVBlank = 16u;

static const size_t kCpuTime = 7u;
static const size_t kGpuTime = 17u;

static const size_t kCpuFrameVariance = 2u;
static const size_t kGpuFrameVariance = 2u;

Can be interpreted as the following:

VSync happens every 16ms (not 16.66667ms)
Cpu takes between 5 & 9ms to do work (this is simulated at random). kCpuTime +/- kCpuFrameVariance
Gpu takes between 15 & 19ms to do work (this is simulated at random). kGpuTime +/- kGpuFrameVariance
3 swapchains (triple buffer)
2 buffer counts

Results:

Summary:
Total VBLANKs hits = 60; missed = 2
Avg FPS = 61.76
Avg Lag = 44.43; Worst Lag = 49

If we change kNumSwapchains to 2 (double buffer), we get:

Summary:
Total VBLANKs hits = 43; missed = 19
Avg FPS = 46.98
Avg Lag = 52.28; Worst Lag = 79

And if we use kNumBuffers = 3 & kNumSwapchains = 4

Summary:
Total VBLANKs hits = 61; missed = 1
Avg FPS = 63.42
Avg Lag = 53.57; Worst Lag = 59

Avg FPS improved slightly, but avg lag got worse compared to kNumBuffers = 2 & kNumSwapchains = 3

The GPU is struggling to maintain 60 FPS, and triple buffer improved framerate AND lag.

However if we repeat the test with kGpuTime = 12 (that is, between 10 & 14ms):

kNumBuffers = 2u;
kNumSwapchains = 2u;

kVBlank = 16u;

kCpuTime = 7u;
kGpuTime = 12u;

kCpuFrameVariance = 2u;
kGpuFrameVariance = 2u;

Summary:
Total VBLANKs hits = 61; missed = 1
Avg FPS = 62.63
Avg Lag = 36.89; Worst Lag = 46

kNumSwapchains = 3u;

Summary:
Total VBLANKs hits = 61; missed = 1
Avg FPS = 63.61
Avg Lag = 52.07; Worst Lag = 54

Triple buffer improved framerate but made lag much worse.

You can download the snippet and compile it locally and play with the results.

darksylinc commented 1 year ago

OK One thing I left out from the things I removed:

The main problem requested of the proposal is fighting lag. Triple/Double buffer is a way to forcing certain behavior that has a tendency to reduce lag as a side effect.

However if we want really low lag, that can be achieved by measuring frametimes, estimating how long, and sleeping. I talked about this in-depth in Stack Overflow

The TL;DR is that IF (big if) we can correctly estimate how long rendering will take on the next frame, let's say it will take 10ms, then we have to sleep for another 6ms so that we start preparing command as late as possible.

This allows us to see keystrokes / mouse clicks etc that happened during those 6ms we slept; that would've otherwise be delayed for the next time the CPU is free.

There is a lot of devil in the details though.

I saw that fighting games like Guilty Gear Xrd took a very silly but good approach: They have a calibration section in the Options; and ask the user to press the button until it hits the rhythm. Assuming the system is fast enough to almost always hit VSync, this is a lazy (yet possibly effective) way of calculating how long to sleep.

The plumbing behind that boils down to storing a number and then calling Sleep(saved_number) at the beginning of the frame.

Calinou commented 1 year ago

Thanks for the great writeup and simulator :slightly_smiling_face:

However if we want really low lag, that can be achieved by measuring frametimes, estimating how long, and sleeping. I talked about this in-depth in Stack Overflow

I wonder how this relates to frame delta smoothing. Can a similar estimation logic be used?

darksylinc commented 1 year ago

I wonder how this relates to https://github.com/godotengine/godot/pull/52314. Can a similar estimation logic be used?

"Yesn't".

One would have to see if the logic is useful/reusable, but really the hard part is that we need to measure:

How long the CPU took to prepare and submit a frame (easy)
How long the GPU took to render (needs timer queries, medium difficulty. Biggest issue is that timer queries have a 2/3 frame latency)
Ensure that our timings are free from any VSync waiting
- e.g. if we can render at 200fps because we take 8ms in total but are being throttled to 60fps because VSync, then measure 8ms without the extra 8.67ms
- The code you pointed out to is not free from this
Know how long we have left until the next VSync (hard)
Know whether we're before or after VSync (hardest).
If we overshoot, recovering can get pretty hard (e.g. we start submitting too late every frame, and get stuck with the worst possible latency for some time)

That's why Guilty Gear Xrd solution is so stupidly simple: Since fighting games have very stable framerate (they display the same two characters throughout the entire session, with the same background; in a controlled scenario) they can just ask the user what feels right until the user manually finds the right amount of time to sleep per frame.

darksylinc commented 1 year ago

The VSync simulator is now interactive and online.

alvinhochun commented 3 months ago

https://github.com/godotengine/godot/pull/87340 has added the option rendering/rendering_device/vsync/frame_queue_size and rendering/rendering_device/vsync/swapchain_image_count.

godotengine / godot-proposals

Expose double-buffered V-Sync as an option #4065

Describe the project you are working on

Describe the problem or limitation you are having in your project

Describe the feature / enhancement and how it helps to overcome the problem or limitation

Describe how your proposal will work, with code, pseudo-code, mock-ups, and/or diagrams

If this enhancement will not be used often, can it be worked around with a few lines of script?

Is there a reason why this should be core and not an add-on in the asset library?

Backbuffers or Swapchains

Number of frames the CPU is allowed to go ahead the GPU

Swapchain count vs Buffer count