krOoze / Hello_Triangle

Hello World like demo for Vulkan API
http://vulkan.hys.cz/
The Unlicense
138 stars 16 forks source link

Why MaxFlightNumb should be always 2? #1

Closed ChanLee1123 closed 4 years ago

ChanLee1123 commented 5 years ago

Hi @krOoze. I'm referring to your code for my vulkan engine. Always Thank you for your good resource.

I assume that Your tutorial code is using triple buffering.

I have questions for these lines https://github.com/krOoze/Hello_Triangle/blob/3a579d18ec333efaf646a7f10ec597a3fed18147/src/HelloTriangle.cpp#L365-L370

I want to know why more than 2 does not make sense. Because, When you have 3 swapchain images, I think you can submit 3 images, which is achieved by vkQueueSubmit() limited by 3 fences respectively. So, I think we should always create the same number of sync objects as the number of the swapchain images.

In addition, when you create sync object, https://github.com/krOoze/Hello_Triangle/blob/3a579d18ec333efaf646a7f10ec597a3fed18147/src/HelloTriangle.cpp#L453-L457

You just created only one renderDoneSemaphore for sync object. And this makes me confused with the triple buffering system with Vulkan. Because when you try to submit one image, You will construct one set of sync objects, command buffers. So, I think you have to use another semaphore when you try to submit the next image.

Specifically, I mean that when you do vkQueueSubmit, you are submitting a combination of command buffers and sync objects to the GPU queue. So, I think the driver and the GPU will use the set of command buffers and sync objects to process rendering process and synchronization.

That's the reason why I think We should have the same number of renderingDone semaphore as the swapchain image count.

I pushed my version on my fork. You can view the changes. https://github.com/ChanLee1123/Hello_Triangle/commit/a60f966ae9a0e474a98d6d8c73f810d3029eb547

Thanks in advance!

ChanLee1123 commented 5 years ago

In addition to the question above, the reason why I ask the question is that, I got validation error on Galaxy mobile device with ARM gpu , when I applied the sync strategy which i mentioned.

ERROR: [Validation] Code 383778914 :  [ VUID-vkBeginCommandBuffer-commandBuffer-00049 ] 
Object: 0x7e0afa3d78 (Type = 6) | Calling vkBeginCommandBuffer() on active command buffer 
7e0afa3d78 before it has completed. You must check command buffer fence before this call. The spec
 valid usage text states 'commandBuffer must not be in the recording or pending state.' 
(https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-
vkBeginCommandBuffer-commandBuffer-00049)

when the present mode of the swapchain is VK_PRESENT_MODE_MAILBOX_KHR, the validation layer gives me the error. But when I use VK_PRESENT_MODE_FIFO_KHR, it doesn't. I know VK_PRESET_MODE_FIFO_KHR is the stable way to present image on vulkan, But I don't know why the VK_PRESENT_MODE_MAILBOX_KHR makes some validation error.

krOoze commented 5 years ago

Oh thanks, that's kind of you to say.

Well firstly, I ask for capabilities.minImageCount + 1 images, which guarantees (at least) two acquirable images.

Secondly, this is not an interactive app on account of being the basicest of demos. But I somewhat assume it is. Going beyond two frames of input lag feels perverse given it would start to equal to the internet latency.

Thirdly, 3 or more sounds unrealistic. That assumes work two frames back is still being executed while the most recent frame is also being rendered. That would be weird. Meanwhile it would require more duplicates of the resources (on the account of this being the simplest of apps, I only need to duplicate the Semaphores, but it might be a problem for a more complex app)

That's the reason why I think We should have the same number of renderingDone semaphore as the swapchain image count.

Not really. There are two concerns.

Resources directly tied to swapchain images do need to be duplicated to swapchain image count. I.e. VkFramebuffer and anything using that.

The second concern is to prevent sync hazards. Those are prevented by the Fence waits, and so need to be only duplicated to the Fence count.

Of course if you make Fence count equal Swapchain Image count, then that is one and the same thing.

So, I think you have to use another semaphore when you try to submit the next image.

Well I may have been too smart here. Definitely would warrant some comments.

imageReadyS is duplicated, because it is conceavable vkAcquireNextImageKHR is called on it twice without it being unsignaled inbetween by the vkQueueSubmit workload. I.e. vkAcquireNextImageKHR Semaphore signaling does not guarantee all previous work in the queue has finished.

renderDoneS is not duplicated, because vkQueueSubmit Semaphore signaling does guarantee all previous work in the queue has finished. That way I know the Semaphore cannot be signaled before it is unsignaled first be the Present.

At least that is my best current interpretaion of the specification. Although Swapchain synchronization is still an open issue.

Calling vkBeginCommandBuffer() on active command buffer

hm, that error seems weird to me considering vkBeginCommandBuffer is called only on swapchain recreation, which should be protected by vkDeviceWaitIdle. Would you mind getting a VK_LAYER_LUNARG_api_dump. Also what is your SDK version (I see you deleted the version check)?

BTW I have not really tested VK_PRESENT_MODE_MAILBOX_KHR. I should try that...

krOoze commented 5 years ago

PS: I get no validation errors with neither your code, nor mine configured to MAILBOX. My SDK is the recent 1.1.121.2.

ChanLee1123 commented 5 years ago

I appreciate you for replying my questions.


Secondly, this is not an interactive app on account of being the basicest of demos. But I somewhat assume it is. Going beyond two frames of input lag feels perverse given it would start to equal to the internet latency.

Thirdly, 3 or more sounds unrealistic. That assumes work two frames back is still being executed while the most recent frame is also being rendered. That would be weird. Meanwhile it would require more duplicates of the resources (on the account of this being the simplest of apps, I only need to duplicate the Semaphores, but it might be a problem for a more complex app)

I'm persuaded to think like what you said now. I thought it will be just okay for an interactive application to render a few of next frames in advance for user. However, If it does, then It will be not the interactive application.

But, Now I doubt the need of Triple Buffering System. Because As you said, we need to sync rendered images with user input, I think just double buffering is enough. Because I think the time line of normal game loop is :

1. Take Input
2. Update States
3. Render To Image
4. Present Image to Display

And It means that we have to wait until the presentation engine present an image to the display. As soon as it finishes presenting, we can start next frame. However again, I found one Youtube video about various sync strategy for games.

https://en.wikipedia.org/wiki/Multiple_buffering#/media/File:Comparison_double_triple_buffering.svg https://www.youtube.com/watch?v=L07t_mY2LEU&feature=youtu.be

After watching the resource and the videos, I think 3 or more frames works for the interactive application such as games on PC that users want to put the input fast, since the application should reduce the input lag and tearing issue and show high FPS. I mean a human can't handle totally frame rendering very fast with his/her own input. So, the 3 or more frames will work with the game with MAILBOX mode where the presentation engine selects only the swapchain image rendered at the time closest to the vertical sync time. So, I conclude that applications should consider its own use and the hardware platform which it resides on, in order to establish sync strategy for presenting images.

Actually, I referred to this resource in early days of studying Vulkan. 2-Vulkan-Case-Study.pdf

The resource says that the application is using 3 fence for triple buffering with FIFO mode on the swapchain part. However, what you said to me, the 2 frames make sense only on the present mode FIFO.


Well I may have been too smart here. Definitely would warrant some comments.

imageReadyS is duplicated, because it is conceavable vkAcquireNextImageKHR is called on it twice without it being unsignaled inbetween by the vkQueueSubmit workload. I.e. vkAcquireNextImageKHR Semaphore signaling does not guarantee all previous work in the queue has finished.

renderDoneS is not duplicated, because vkQueueSubmit Semaphore signaling does guarantee all previous work in the queue has finished. That way I know the Semaphore cannot be signaled before it is unsignaled first be the Present.

Yes. it's right In the case of you using only 2 frames. Now I can know how to use sync objects in the case. So, I think your demo application is aiming at the stable and precise rendering with only 2 frames. And I think the sync structure you use is also related to the vulkan issue on https://github.com/KhronosGroup/Vulkan-Docs/issues/370.


hm, that error seems weird to me considering vkBeginCommandBuffer is called only on swapchain recreation, which should be protected by vkDeviceWaitIdle. Would you mind getting a VK_LAYER_LUNARG_api_dump. Also what is your SDK version (I see you deleted the version check)?

I'm sorry that I just used the strategy which i mentioned on my engine. I will test your code on my device and show the result to you. my engine structure is recording commands every frame. So, If I can modify the structure of your code to make it matched with my engine, I will share the code on my fork.

krOoze commented 5 years ago

But, Now I doubt the need of Triple Buffering System. Because As you said, we need to sync rendered images with user input, I think just double buffering is enough.

Well, traditionally tripple buffering is supposed to give slightly better throughput. Double buffering slightly better latency.

Not sure if it is such a problem in Vulkan where we are bit more explicit. If I ask for an extra Image, then I simply reduce the chance that vkAcquireNextImageKHR makes me wait for the image (by blocking or via the semaphore). Meanwhile I could do some tricks with the Image that do not need input. BTW, I could synchronize Input only with a Semaphore (the Fence would be needed only if I want to reuse some specific buffer which the GPU might already be using).

Anyway, I digress. One thing I should have mentioned first is this code is not "architecturally" correct. It is a learning aid to expose people to the Vulkan commands and their purpose. More than anything the Fences are a stopgap measure for the Issue mentioned in the comments (perceived "memory leak"), rather than establishing some specific buffering scheme.

After watching the resource and the videos, I think 3 or more frames works for the interactive application such as games on PC that users want to put the input fast, since the application should reduce the input lag and tearing issue and show high FPS.

Well, IMO less latency is always better (for certain games). I can buy a new GPU to get better FPS. As a consumer, no money can buy me better latency.

Actually, I referred to this resource in early days of studying Vulkan. 2-Vulkan-Case-Study.pdf

Careful with resources from 2016. Also note it refers primarily to Android. Touchscreen probably already costs massive latency, so might as well not care.

AIS though my fence use does not imply anything about the achitecture of a "good" app.

However, what you said to me, the 2 frames make sense only on the present mode FIFO.

IMO it makes sense universally. If two frames old image is still being rendered, that is weird and I am stopping that with the Fence wait and letting the GPU catch up with the CPU.

That does not automatically mean though the slides are wrong. Perhaps they are solving a different concern. Notice they reset the Command Buffer each frame. In order to do so correctly they must protect that Command Buffer with a Fence (to be sure it is not active, while it is being reset).

Yes. it's right In the case of you using only 2 frames. Now I can know how to use sync objects in the case.

I may be misleading you. This really needs to be fixed in the specification. I think I will duplicate the second semaphore too to be safe.

And I think the sync structure you use is also related to the vulkan issue on KhronosGroup/Vulkan-Docs#370.

I am not really solving anything about timing the presentation. Nor (as the Issue says) is there reliable way to do it.

I would like to imprint on you that the vanilla Vulkan Fences or Semaphore syncing says nothing about when stuff actually appear on screen. What the driver is allowed to do is copy the Swapchain image, and so you have no way to infer when your image appeared on screen. I can only know Vulkan is done with that particular VkImage and so I can use it.

my engine structure is recording commands every frame.

Right, if you do that you need to protect that command buffer state. It must not be active while you reset it. I commented on this above wrt the Samsung slides.

ChanLee1123 commented 5 years ago

Anyway, I digress. One thing I should have mentioned first is this code is not "architecturally" correct. It is a learning aid to expose people to the Vulkan commands and their purpose. More than anything the Fences are a stopgap measure for the Issue mentioned in the comments (perceived "memory leak"), rather than establishing some specific buffering scheme.

Okay. It's the basic tutorial of Vulkan. And it just prevents memory leak as a temporary way.

Well, IMO less latency is always better (for certain games). I can buy a new GPU to get better FPS. As a consumer, no money can buy me better latency.

I was confused with those. I think what you said is right now.

IMO it makes sense universally. If two frames old image is still being rendered, that is weird and I am stopping that with the Fence wait and letting the GPU catch up with the CPU.

Yes now I agree it with the FIFO mode. But, What about MAIL_BOX mode?

I may be misleading you. This really needs to be fixed in the specification. I think I will duplicate the second semaphore too to be safe.

Could you indicate the exact content related to this Vulkan issue on the github issue you raised?

I am not really solving anything about timing the presentation. Nor (as the Issue says) is there reliable way to do it.

I would like to imprint on you that the vanilla Vulkan Fences or Semaphore syncing says nothing about when stuff actually appear on screen. What the driver is allowed to do is copy the Swapchain image, and so you have no way to infer when your image appeared on screen. I can only know Vulkan is done with that particular VkImage and so I can use it.

Okay. I understand the context now. So, according to the spec, the safe way to render and present an image is using two frames by not dismissing old images.

krOoze commented 5 years ago

Yes now I agree it with the FIFO mode. But, What about MAIL_BOX mode?

I would say it is valid POV for any present mode. If I restate it another way: What I am doing starting rendering another frame, if frame two submissions back is not even finished and ready for presentation?

I mean the GPU should be already fed enough. It has the rest of the old frame to finish, and it has the whole previous frame to finish. The CPU is too far ahead of the GPU if it wants to start another frame in that situation.

Besides, I get bit of a consistency across platforms. My platform has minImage = 2, so I ask for 3 swapchain images. It is platform dependent whether I get the 3rd image (without blocking). I should be able to get the 2nd image with minimal blocking, the 1st image should be available without any blocking (at least with some common sense, the presentation has no good reason to hold on to it). With the fences I know it will block either way on the 3rd (and I never ask for more than 2 at a time).

Could you indicate the exact content related to this Vulkan issue on the github issue you raised?

I don't want to bore you too much with spec issues. There is some fix from two days ago in 1.1.124, which defines a "signal operation order". I haven't had the chance to read it yet (and poke logical holes in it 😈 ).

The original issues are: https://github.com/KhronosGroup/Vulkan-Docs/issues/152 https://github.com/KhronosGroup/Vulkan-Docs/issues/595 https://github.com/KhronosGroup/Vulkan-Docs/pull/785

Okay. I understand the context now. So, according to the spec, the safe way to render and present an image is using two frames by not dismissing old images.

I am not exactly sure what you are saying.

I think the spec is not specific enough to say anything of the sort. It has to match various platforms and situations. So very varied behaviour is allowed in the spec, and any guarantees given are quite minimalistic.

PS: me using maximum of 2 frames "in-flight" does not limit FPS in any way. Even in MAILBOX it runs at like 4000 FPS and so bunch of frames would be dismissed in that presentation mode.

krOoze commented 4 years ago

Thanks again for the Issue!

So I re-verified my synch scheme and documented it a bit in doc/Schema.pdf

According to my best interpretation of the spec, I think my code is mostly conformant at least.

I had to handle some obscure corner-cases though:

As for the number of in-flight frames, and independence from swapchain images:

With that I will be closing the Issue. Please get back to me with any remaining concerns.

krOoze commented 4 years ago

You just created only one renderDoneSemaphore for sync object.

FYI, I have updated renderDoneSemaphore to one-per-swapchain image in response to information I have been given at https://github.com/KhronosGroup/Vulkan-Docs/issues/1150. I previously dismissed such extreme interpretation there, because even the reference implementation vkcube does not seem to honor it.

ChanLee1123 commented 4 years ago

Thank you for your great work to improve and interpret Vulkan spec. I'm learning a lot from you.

I want to say something unrelated to this issue. If you don't want, I can delete this comment. So If you want, tell me.

While looking through some of Vulkan Docs Issues, especially acquire-submit-present with fence and semaphore, I've started to doubt the usage of Vulkan. I know it's in its early development, but lots of things are uncertain and difficult to read and understand for lots of users. Also, The parts of semaphores and fences with acquire-submit-present are the things where even Khronos developers sway on their own complexity.

Anyway, I will also try to contribute to Vulkan If I can, while looking at your hard work on Vulkan. Have a good day.

krOoze commented 4 years ago

@ChanLee1123 That's fair.

The sync system used to be in worse state in the spec. Hopefully this is only the last vestige of the old problem. This specific problem here has a relevant spec Issue from 2016, so that is inexcusable. But still.

Also the WSI design is little bit a design by resignation. The OS API it needs to use can be equally if not more riddiculous (and sometimes ooold). And last time they tried something more integrated and sane, they got burned (MS let OpenGL rot at version 1.1). It is bit of a shame, because those shiny timeline semaphores were released, but they cannot even be used with WSI. And we do not even have well-defined timing of the present (access to VSync event).

Let's make one thing clear though. Vulkan is something like a HAL; an API for writing more abstract middleware. And, well, reading specifications can be hard (should be formal\"certain" though to a technical reader, for which we are sacrificing the ease of reading). There are some users drawn to Vulkan, then would want to "dumb it down" a bit. But that is simply not the usecase for Vulkan. Such users simply need an engine like Unreal or Unity. Vulkan is exactly the lowest part that worries only about pushing the bits, rather than about graphics.

ChanLee1123 commented 4 years ago

Hi @krOoze. I appreciate your contribution to Vulkan API specification and samples.

I'm here again to recall what you answered, but I got questions about command buffer usages.

The simple version of my question is : Why should I use the same number of command buffer as the number of swapchain images? Why can't I use just only one command buffer for multiple swapchain images?

The verbose version of my question is : your sample is recording all of the command buffers before rendering. because you know what should be rendered. But normally, a renderer does not know what it should render. So, It may record differently each frame, but will use the framebuffer which is selected by the index of vkAcquireNextImageIndexKHR. So assume that you will record the command buffer dynamically, which means you will record differently from the previous frame.

So In this situation, I thought I can use only one command buffer to submit it for multiple frames. the reason why I think like that is I will record the command buffer after vkAcquireNextImageIndexKHR, which let me know the gpu finished using the previous command buffer with the fence. Because the gpu finished using the command buffer, I think I can resue the command buffer. But when I call vkBeginCommandBuffer with VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT flag, it outputs validation layer error, (I'm using Vulkan 1.1 version)

[ VUID-vkBeginCommandBuffer-commandBuffer-00049 ] 
Object: 0x135bb906828 (Type = 6) | 
Calling vkBeginCommandBuffer() on active command buffer 0x135bb906828 before it has completed. 
You must check command buffer fence before this call. 
The Vulkan spec states: commandBuffer must not be in the recording or pending state. 
(https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkBeginCommandBuffer-commandBuffer-00049)

Am I still being confused with the synchronization of rendering? I will appreciate you if you let me know about it.

krOoze commented 4 years ago

Why should I use the same number of command buffer as the number of swapchain images?

That's simple. Because Framebuffer is coupled with the command buffer via vkCmdBeginRenderPass. It follows that for each Swapchain Image\Framebuffer you need a different command buffer that is capable of rendering into it.

Or, as probably a more complex app would do, you can re-record a command buffer during the frame. In which case it is not coupled to a specific framebuffer. But you still must track which command buffer to reuse (which is up to you how you do it). I think the obvious way to do it would be to have maxInflightSubmissions number of command buffers and by using the associated Fence you know you could reset and reuse the buffer.

Am I still being confused with the synchronization of rendering?

No, I think you just forgot maxInflightSubmissions is 2 and not 1. If you have only one command buffer, you cannot reset it, because it might still be used in the previous frame. With maxInflightSubmissions == 2 the associated Fence checks the submission two frames back finished, not that the previous frame finished (for that the value would need to be 1).

PS: You talk about vkAcquireNextImageIndexKHR, and not sure if you mean that Fence. In that case it is important to note that the fence from vkAcquireNextImageIndexKHR does not imply all previous submissions have finished. It only implies some swapchain image is available for you.

ChanLee1123 commented 4 years ago

ah you really made a good point for me to understand Vulkan more. Thank you!!

I have one more question about buffers. And it's not involved in your sample. (sorry for bothering you) But this question is also related to the render loop. I can't make my question simple, so I should suggest a situation for it.

Assume that on every frame, I'm doing something for a buffer and then recording a command buffer of the frame with vkCmdBindDescriptorSets which uses the buffer.

The buffer I'm talking about is used to upload data from cpu to gpu for each frame, such as uniform buffer or skinned animation data. and I'm creating the buffer with the memory allocation flag VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT or VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_CACHED_BIT.

I get the cpu pointer from vkMapMemory of a buffer with the above second flag and copy my data into the pointer and flush data with vkUnmapMemory. What makes me confused is that I can do this with one buffer every frame. This situation does not output any validation error. If I try to delete the buffer at the second frame which is being used in the first frame, I can get the validation error. So I think uploading data every frame should also output validation error. But because It's not giving me any validation error, I think I don't know about this well.

What I think is that, because the buffer is being used in the previous frame, I think I can't update the buffer for each frame, which means specifically uploading the data from cpu to gpu. So, I thought I need at least maxInflightSubmissions number of buffers to update data into the GPU appropriately. I'd like to ask you about these buffer things

krOoze commented 4 years ago

vkUnmapMemory does not flush data. Only VK_MEMORY_PROPERTY_HOST_COHERENT_BIT or vkFlush* flushes data. Unmap just undoes vkMapMemory, which might be an expensive operation. So keep a resource mapped if you ever plan to use it again from host.

It does not show validation error, because layers cannot intercept writes to a pointer; they can only intercept command calls. The layers have no way to form a sync proof there, so you need to be careful. And vkMap* and vkUnmap in of themselves do not count as memory access. Buffer destruction (well, the Memory object deallocation anyway) can be be intercepted by layers, and for all intents and purposes it can be counted as a write access, so you would get an error.

ChanLee1123 commented 4 years ago

Oh I didn't know about the vkMapMemory, vkUnmapMemory well. Because It's only about memory write access, the layer does not say anything about it.

I cleared what made me confused about updating a buffer. So, I know I should have two buffers or one buffer with two times size memory to prevent memory write while the gpu is reading it, if I only use it for all frames. But I start to think about this specification part, http://vulkan.hys.cz/chunked_smart/vkspec.html#synchronization-submission-host-writes. At first, I thought prior host operations may be something that I copy the modified data into the mapped memory pointer. But I think it's not. Can you let me know what the 'prior host operations' is? and why the Vulkan specification is saying it for the synchronization between the 'prior host operations' and commands submitted?

Not related to this topic, I sent some sweet sweet money for your Vulkan work!

krOoze commented 4 years ago

Ah, thank you very much for the lovely currency. Much appreciated!

Right, map and unmap just adds or removes the memory from your virtual address space. It tells layers nothing about if you actually read or written the memory. Without some invasive OS debug feature it is impossible to know if the user of the API written into a pointer.

Yea, "prior host operation" means the writes to the mapped pointer, or as the spec later says:

The first access scope includes all host writes to mappable device memory that are available to the host memory domain.

It is one of the few exceptions to explicit Vulkan synchronization. In practice what is says is, that if you write the mapped pointer, then on the device side (device domain) it is already aware of your modifications for any subsequent call of vkSubmit (and so you do not need to use a pipeline barrier for the writes via the pointer).

That is in contrast with the situation where you first call vkSubmit (e.g. you instead block it with a timeline semaphore) and only then write into the mapped pointer. In which case the device domain is not aware of your modifications, and you do have to use a pipeline barrier with VK_PIPELINE_STAGE_HOST_BIT before you read those modification on the device.