GPU API Feedback - Githubissues

darksylinc commented 2 years ago

Hi!

I saw the GPU API push and have a bit of feedback to provide:

Unsupported vertex formats

SDL_GPUVERTFMT_CHAR
SDL_GPUVERTFMT_UCHAR
SDL_GPUVERTFMT_UCHAR3
SDL_GPUVERTFMT_CHAR3
SDL_GPUVERTFMT_USHORT3
SDL_GPUVERTFMT_SHORT3
SDL_GPUVERTFMT_HALF3

Get rid of them (including the NORMALIZED variants). GPU support is flaky for them (or non-existent) and requires expensive emulation, which is hard to get right due to alignment issues.

Merging format enums?

typedef enum SDL_GpuVertexFormat
typedef enum SDL_GpuPixelFormat

The trend is to unify these two; since supported GpuVertexFormat is just a subset of GpuPixelFormat

But personally I'm not decided for either option. They have pros and cons. From a low API perspective, unification makes sense. At a higher level? I don't know.

Format enum

Vulkan, DXGI and MTLPixelFormat formats are awfully similar.

In my experience this has been the best way to encode formats and intention (compared to alternatives like e.g. storing bits and shifts like DDS did back in the D3D7/DirectDraw days)

Personally I prefer the short versions, e.g. RGBA8_UNORM is just better than RGBA8888_UNORM when all formats have the same bit. Likewise RGB10A2 over RGBA1010102.

DXGI uses TYPELESS to indicate the format can be reinterpreted. Vulkan & Metal have no such format. In OgreNext we approach that by having a getFamily function:

SDL_GpuPixelFormat getFamily( SDL_GpuPixelFormat f );

This format returns the same format for all the variations, in order to simplify many operations (e.g. RGBA8_UNORM and RGBA8 are exactly the same in all ways except in how the data is returned when viewed by a shader).

More flags

typedef enum SDL_GpuTextureUsage
{
    SDL_GPUTEXUSAGE_SHADERREAD,
    SDL_GPUTEXUSAGE_SHADERWRITE,
    SDL_GPUTEXUSAGE_RENDERTARGET
} SDL_GpuTextureUsage;

You'll want to include a few more flags such as:

SDL_GPUTEXUSAGE_MIPMAPGEN. When user wants SDL to generate Mipmaps, on D3D11 this means RENDERTARGET flag must be added so D3D11's GenerateMipmaps can be used. But on D3D12, it means SHADERWRITE must be added so a compute shader can generate the mipmaps.
SDL_GPUTEXUSAGE_NOT_SAMPLED

Almost all textures want to be sampled, but there's a few where the user intends to use only read and write operations, without sampling. Hence a flag with negation to indicate sampling is not necessary. Or alternatively add SDL_GPUTEXUSAGE_SAMPLED and have the user almost always add this flag.

Although generally flags should express positive terms instead of negation; In terms of user friendliness I think this is an exception, i.e. SDL_GPUTEXUSAGE_NOT_SAMPLED makes sense over SDL_GPUTEXUSAGE_SAMPLED.

SDL_GPUTEXUSAGE_CUBE_ARRAY_2D

Indicates the CubeMap or CubemapArray wants to also be used as ARRAY_2D.

SDL_GPUTEXUSAGE_REINTERPRETABLE

Indicates the texture will later be reinterpreted as another format.

More Texture types

    SDL_GPUTEXTYPE_1D,
    SDL_GPUTEXTYPE_2D,
    SDL_GPUTEXTYPE_CUBE,
    SDL_GPUTEXTYPE_3D

There's a few types missing, like CUBE_ARRAY, 2D_ARRAY, etc.

Being very clear about 'depth'

typedef struct SDL_GpuTextureDescription
{
    // ...
    Uint32 width;
    Uint32 height;
    Uint32 depth;
    Uint32 mipmap_levels;
    // ...
}

Documentation needs to be very clear about how depth is interpreted:

When SDL_GPUTEXTYPE_CUBE, depth must be 6 or 1?
- Personal recommendation: Depth must be 6 (see next item)
When SDL_GPUTEXTYPE_CUBE_ARRAY, depth must be multiple of 6? or should depth be 1, 2, 3 and is internally multiplied by 6?
- Personal recommendation: Depth must be multiple of 6 (see next item)
When SDL_GPUTEXTYPE_CUBE_ARRAY but the texture is reinterpreted to SDL_GPUTEXTYPE_2D_ARRAY and depth was 1 but internally multiplied by 6, is depth now 6x on the reinterpreted texture?
- This is why I recommend to be multiple of 6 for SDL_GPUTEXTYPE_CUBE_*. When reinterpreted to SDL_GPUTEXTYPE_2D_ARRAY, all code is consistent and there's no surprise. There's no need for special handling.

I also recommend to rename depth to depthOrSlices. Its interpretation depends on whether the texture is of type SDL_GPUTEXTYPE_3D or not.

What's the difference between depth and slice?

depth is halved when going to the next mip.
slice is constant for all mips.

OpenGL

OpenGL will by far give you the biggest headaches and constraints in your API:

OpenGL does not support samplers being separate from textures at the shader level
- This is hard to emulate. You have to analyze the shader and find all pair combinations of \<texture, sampler> and then bind them all potentially binding the same texture multiple times. Then generate a GLSL shader that remaps all of that. So if the original shader uses 5 textures and 3 samplers, at best you will have to bind 5 textures (w/ 5 samplers) and at worst case 15 textures (w/ samplers)
OpenGL threading support is bad. Really bad. The biggest problem in this area is when trying to map a GPU buffer from multiple threads. The best solution is to just disallow this.
- One possible solution is to run GL in its own thread (aka RHI thread as Unreal likes to call it)
- All API calls from any thread are submitted/redirected to the RHI thread
- Some API calls like Draw commands can return immediately, other API calls will stall until the RHI sees it and returns data (such as when mapping buffers)
- For performance this may mean some calls end up returning "OK" but later fail when the RHI sees it and tries to execute it. SDL could have a debug flag to wait for every RHI calls, and should be off in Release.

One-time mipmaps

There's mainly 2 cases for generating mipmaps on the fly:

We're rendering to a texture, and need mipmaps for it.
- We'll be calling this often (e.g. once per frame or more)
Textures are loading from disk. They were not bundled with mipmaps.
- We'll be generating mipmaps once. Then never again.

The first case is not an issue. The second case is. The problem is that the regular code:

Create texture with SDL_GPUTEXUSAGE_MIPMAPGEN
Load texture from disk
Call SDL_GpuGenerateMipmaps

is inefficient. Why? Because the SDL_GPUTEXUSAGE_MIPMAPGEN flag will stay forever which means the texture is internally either RENDERTARGET or SHADERWRITE when it doesn't have to, disabling lots of internal GPU optimizations.

The easiest way to achieve this (cross platform):

Generate tmp texture
Load data into tmp texture
Generate mipmaps into tmp
Copy results into final texture
Destroy tmp (or keep it cached for reuse)

An API that is easy to use could provide this convenience, by handling this TMP texture internally when loading data from disk.

SDL_GpuPipelineDescription's attachments

SDL_GpuPipelineDescription contains SDL_GpuColorAttachmentDescription, which contains a SDL_GpuTexture.

This is wrong. What's needed in the Pipeline desc is its metadata (format, type, msaa settings, etc; except resolution). A structure should contain this data, and could be obtained via API:

SDL_GpuTextureMetadata *metadata = SDL_GetMetadata( SDL_GpuTexture *texture ) Actually SDL_GpuTextureMetadata = SDL_GpuTextureDescription but without the char *name.

or alternatively (perhaps this is even better):

SDL_GpuColorAttachmentDescriptionMeta *metaDescriptor = SDL_GetMeta( SDL_GpuColorAttachmentDescription *descriptor )

amerkoleci commented 2 years ago

Hi, I would like to give my feedback as well, why separation of depth stencil state? This would work on metal but d3d12/vulkan wants all states for graphics pipeline creation, otherwise you'll need to cache - create pipelines on demand which can result on gpu slutter, so probably SDL_GpuPipelineDescription needs all members necessary to create vulkan/d3d12 pipeline.

amerkoleci commented 2 years ago

Another topic I would like to bring attention is: Memory allocation, what strategy would be used? Or existing D3D12MA/VMA?

icculus commented 2 years ago

why separation of depth stencil state? This would work on metal

Metal was my starting point for designing this, specifically because I think it's the most app-friendly (at least in the API concepts, if we're willing to avoid debates about Metal Shader Language and Objective-C). But as @darksylinc's detailed notes also point out, there are several small things in Metal that are going to cause headaches elsewhere, and this is probably also a good example of that.

Some of these might be worth changing, some might be worth faking on this platform or that, and some are worth putting in the trash. Definitely point these things out and we'll make changes to the API as we go. We are in extremely early times for this project.

Memory allocation, what strategy would be used? Or existing D3D12MA/VMA?

Probably something like VMA behind the scenes, but my sincere hope is to hide this from the app if I can get away with it.

icculus commented 2 years ago

Unsupported vertex formats

Removed! This remaining list honestly still feels like it could lose a bunch more formats and no one would miss them, too.

The trend is to unify these two; since supported GpuVertexFormat is just a subset of GpuPixelFormat

But these aren't the same thing at all...? Maybe I'm misunderstanding, but if/when we add compressed texture formats, we're definitely going to need these to be separate, as they won't just be packed arrays of raw pixels anymore.

(I'll go through the rest of your comments in the morning, just didn't want you to think I was ignoring you!)

darksylinc commented 2 years ago

But these aren't the same thing at all...? Maybe I'm misunderstanding, but if/when we add compressed texture formats, we're definitely going to need these to be separate, as they won't just be packed arrays of raw pixels anymore.

I've been giving more thought about it and I agree it makes sense to keep them separate. That makes it less error prone, otherwise you can feed a GpuPixelFormat that is not a valid GpuVertexFormat.

From a low level perspective it makes sense, since like I said GpuVertexFormat is a subset of GpuPixelFormat thus it simplifies code (e.g. same conversion routines that work for RGBA8_UNORM work for UCHAR4_UNORM) and allows for potential expansion in the future (i.e. this has never happened though, like supporting RGBA565 or ASTC as a vertex format).

(I'll go through the rest of your comments in the morning, just didn't want you to think I was ignoring you!)

No prob and thanks for being considerate! :)

icculus commented 2 years ago

To revisit:

why separation of depth stencil state? This would work on metal

It looks like primitive topology needs to be specified in the PSO too in D3D12...my misreading of ID3D12GraphicsCommandList::IASetPrimitiveTopology made me think you could set whatever in the command buffer, but the PSO still wants to know if it's generally a point/line/triangle. :/

So we'll probably move SDL_GpuPrimitive into SDL_GpuPipelineDescription along with the depth stencil state, too.

icculus commented 2 years ago

Generate tmp texture

Load data into tmp texture

Generate mipmaps into tmp

Copy results into final texture

Destroy tmp (or keep it cached for reuse)

I'm thinking we can do this in SDL_GpuGenerateMipmaps(), with the backend deciding whether a texture is ideal (it doesn't have SHADERWRITE on d3d12, etc, so it does the temporary texture proxy, otherwise it just does it directly on the existing texture)...it won't need a new SDL_GPUTEXUSAGE_MIPMAPGEN flag at all in this case, as SDL will handle this transparently.

If doing this transparently takes a performance hit, the primary use case would be loading textures without mipmaps from disk, which is generally considered a slow operation (and for large enough games, is probably tossed into a background thread).

But I might be oversimplifying things here...?

AllianceDarkSylinc commented 2 years ago

If doing this transparently takes a performance hit, the primary use case would be loading textures without mipmaps from disk, which is generally considered a slow operation (and for large enough games, is probably tossed into a background thread).

But I might be oversimplifying things here...?

No, that's a bad idea.

If I have a RENDERTARGET texture, and want to call SDL_GpuGenerateMipmaps every frame, on D3D11 it will work as expected. But on D3D12 it will use the tmp path every frame because it's missing SHADERWRITE,

And even if you keep the MIPMAPGEN flag, you have to consider the issue of silent user error: Users who will call SDL_GpuGenerateMipmaps every frame without the MIPMAPGEN flag would end up using the slow tmp copy path.

And users who who actually want to use the tmp copy path, may end up adding the MIPMAPGEN flag because they enabled it on purpose thinking that it is necessary when it's not.

To fix that silent error the API needs to be more explicit, like SDL_GpuGenerateMipmaps( bool bUserTmpCopyPath ). And the API will give a perf warning if bUserTmpCopyPath == true and MIPMAPGEN is present, and give an error if bUserTmpCopyPath == false and MIPMAPGEN is absent.

Or instead of adding bool bUserTmpCopyPath a new flag MIPMAPGEN_ON_LOAD (or something like that) is added. The flag indicates SDL_GpuGenerateMipmaps can be called on it, but the copy path should be used.

Update: This is the problem OpenGL has. GL lets you call glGenerateMipmaps on any texture at any time. Very convenient. But doing so means the driver must internally retransform the texture to add the RENDERTARGET flag (this is has been the cause of lots of driver bugs in the past) and once it's added it stays forever, which means you pay a hidden price just for calling the function; and only experts will tell you about this pitfall.

icculus commented 2 years ago

If I have a RENDERTARGET texture, and want to call SDL_GpuGenerateMipmaps every frame, on D3D11 it will work as expected. But on D3D12 it will use the tmp path every frame because it's missing SHADERWRITE,

oh, I see. Okay, let me think about this more.

icculus commented 2 years ago

OpenGL

These are all good points about OpenGL, btw, and it will likely work much like you described...but also I'm trying to decide what the surface area looks like for "people that have a modern version of OpenGL and don't have any other options," and I suspect it's pretty small, outside of WebGL in current times and some smaller devices that expect you to use GLES3...which is not to minimize it, but focusing on Vulkan, Direct3D and Metal backends first and then making GL4/GLES3/WebGL2 limp along later seems like a reasonable strategy, at least on paper.

icculus commented 2 years ago

SDL_GpuPipelineDescription contains SDL_GpuColorAttachmentDescription, which contains a SDL_GpuTexture.

It doesn't, I have two unrelated structs with the same name, which is obviously a different problem. :)

icculus commented 2 years ago

Latest revision moves depth stencil details into SDL_GpuPipelineDescription and fixes SDL_GpuColorAttachmentDescription naming.

darksylinc commented 2 years ago

what the surface area looks like for "people that have a modern version of OpenGL and don't have any other options,"

Do you mean as in "My HW or drivers are too old to run anything better" ?

If that's the case then:

On Windows D3D11 is the best bet. You can use FL level 10.0 to target D3D10 HW which go all way back to 2006 with GeForce 8000 series, AMD Radeon HD 2000 (2007), and Intel Series 4 (Year 2008: Eaglelake's GMA X3100, GMA 4500, then HD Graphics).
- There will be a few restrictions that may apply on the features that can be used with FL 10. Good thing is that D3D11 will complain even in your most modern system if you just select FL 10.0
On Windows GL is a terrible bet. The OpenGL drivers for most of these HW is too buggy and not updated anymore in years. Unless you use ANGLE, which will translate OpenGL to D3D11 for you (or in the future, Zink; but that will require Vulkan)
Out of personal experience with Intel GPUs, Intel HD 4000 is usually the oldest that doesn't give a tremendous amounts of pain due to either SW or HW bugs. However technically if you target D3D11 FL 10 the older cards are going to work, just... expect bugs.
On Linux there's the special case of AMD Radeon HD 2000 - 6000 and equivalent Intel GPUs that actually have decent OpenGL 3.x drivers or even "almost 4.x" (thanks to extensions) but not Vulkan. These users typically use Windows instead of Linux. The pool is extremely small but it does exist.

As for the market share in general, I wouldn't dismiss it. It won't and cannot run the latest AAA games, these machines don't appear in Steam survey because it's not their market.

The people who use these machines are usually kids who play the least demanding games they can run (Minecraft, Roblox, Dota, LoL, WoW, CS:GO, Team Fortress, Overwatch, Fortnite) on their parent's "office" computer.

Fortunately the landscape has improve with i3 CPUs becoming more common in offices, which are bundled with a reasonable Intel iGPU (in terms of API/driver support).

amerkoleci commented 2 years ago

@darksylinc what about d3d12? For my game engine I went fully bindless with vulkan and d3d12, would vulkan 1.2 as min spec work?

darksylinc commented 2 years ago

what about d3d12? For my game engine I went fully bindless with vulkan and d3d12, would vulkan 1.2 as min spec work?

What about it? I don't understand the question, and feels a little off topic too.

amerkoleci commented 2 years ago

Is d3d12 viable as better choise over d3d11? And which vulkan min spec is planned to be supported here? Vulkan 1.1, 1.2 or 1.3?

icculus commented 2 years ago

Is d3d12 viable as better choise over d3d11? And which vulkan min spec is planned to be supported here? Vulkan 1.1, 1.2 or 1.3?

I'm starting with D3D12, but my expectation is that I will put a D3D11 backend in as well. I almost certainly won't go lower than that; D3D11 was available starting with Windows 7 (and apparently backported to Windows Vista!), and that feels like more than enough legacy support.

I don't think we'll need anything beyond Vulkan 1.0, at least at first, as far as I can tell, but there's a lot of small details I expect to stumble over as I go, so we'll see what pops up that needs 1.1 or whatnot.

Anyhow, the initial goals are D3D12, Metal, and Vulkan, and then things like GL4/GLES3/WebGPU can be considered (and possibly abandoned).

icculus commented 2 years ago

Pushed changes that resolve most of the discussed things; I think the only thing still unresolved is what to do about generating mipmaps, which I'm still considering.

icculus commented 2 years ago

OpenGL does not support samplers being separate from textures at the shader level

Sorting through this feedback again, and I'm just making sure I'm clear on this part: Is this problem not solved by GL_ARB_sampler_objects?

darksylinc commented 2 years ago

It's not. That partially solves the problem on the C++ side, but not on the shader side

icculus commented 2 years ago

Ah, I see...HLSL lets you do myTexture.Sample(mySampler, texcoords) but GLSL only offers texture(myTextureUnit, texcoords). :(

icculus commented 2 years ago

Opinions welcome on the current plans for vertex shader syntax: icculus/SDL_shader_tools#3

sridenour commented 2 years ago

Re: depth buffer formats

Just FYI: Depth24_Stencil8 isn't available in Metal on iOS (it is on macOS, though). Both platforms do support Depth16, Depth32Float, Stencil8, and the convenience format Depth32_Stencil8.

Maybe SDL GPU should add Depth32Float and Depth32Float_Stencil8, assuming those are also supported on DX12 and Vulkan..

icculus commented 1 year ago

(Just bumping this comment over here to have discussion stay in one place. --ryan.)

I'm not a huge fan of manually exposing Fences as a client-side resource. Since their intended purpose is to query command buffer status and wait on command buffer completion, I think these should be command buffer APIs (SDL_QueryCommandBufferStatus / SDL_WaitForCommandBuffer), with the various backends handling the fences themselves.

Making fence submission optional on the client side also has the side effect of making a Vulkan implementation significantly more complicated, since in most Vulkan renderers, command buffers always need a fence to manage and clean up internal state. This means that the backend would need to manage fences internally anyway, and since Vulkan only allows you to submit one fence per command buffer submission, internal cleanup and user-side notifications would have to be conflated. This seems unnecessarily complicated.

Originally posted by @TheSpydog in https://github.com/libsdl-org/SDL/issues/7067#issuecomment-1382251359

darksylinc commented 1 year ago

I had to dig through my source code to see how it's handled and realized TheSpydog is right, but with some remarks to do.

You only get one VkFence per command buffer. And Metal works in a similar way too (they just call it a semaphore, which has nothing to do with VkSemaphore).

But I would suggest a slight separation, having fences still exposed to user:

SDL_Fence fence = SDL_GetCommandBufferFence( cmdBuffer );
SDL_WaitForFence( fence );

That is, still expose fences to users, but make it explicit that fences come from Cmd Buffers (instead of the user being able to create them on a whim at random).

It also makes explicit that waiting for a fence requires flushing a cmd buffer, which in certain cases could have nasty side effects. If you fully separate fences from cmd buffers, then hiding this fact from users will leave in a gigantic world of pain.

Why?

It's a matter of separation of concepts. A 'fence' is something we are waiting for. A 'command buffer' is a huge chunk of work, potentially consuming a lot of memory, we may not even care about.

In OgreNext we have the following pattern for data we must download GPU -> CPU when highly accurate fence tracking is needed:

vkCmdCopyImageToBuffer( mQueue->mCurrentCmdBuffer, srcTextureVk->getFinalTextureName(),
                        VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, dstBuffer, 1u, &region );

if( accurateTracking )
{
    mAccurateFence = mQueue->acquireCurrentFence();
    // Flush now for accuracy with downloads.
    mQueue->commitAndNextCommandBuffer();
}

We don't really care about cmd buffers. We care that the vkCmdCopyImageToBuffer is done.

In fact the cmd buffer may already start being recycled while the VkFence is still not recycled (once we're done with the fence, we also recycle it too)

Detaching Cmd buffers from fences allows recycling to be separate.

What's an accurate fence?

In OgreNext, when you need to download data from GPU to CPU, you could do 3 things:

Initiate download, 3 frames later (triple buffer) you access the downloaded data. This is protected by the frame's VkFence.
Initiate download, do some mild work, check on the VkFence, if it's not ready do more work. Check again, if we're out of work to do, stall on the VkFence
Initiate download, stall

Accurate fences are for dealing with scenario 2: the download is quite urgent, but we can do some work we have left rather than stalling.

What about command buffers without fences?

The schemes TheSpydog or I propose assume there is one VkFence per Cmd Buffer, while Vulkan has no such requirement (you can submit a cmd buffer without any fence).

Well, you could either assume all cmd buffers need a VkFence (in OgreNext we do that, because rendering engines are complex and for various reasons sooner or later you always end up needed that fence), or just add an API e.g. SDL_SetHasFence( cmdBuffer, true )

EDIT: I suspect disabling a cmd buffer's fence will be a bad idea. In order to know if a cmd buffer is over (and thus recycle or destroy it), you need a fence. If you submit two cmd buffers in a row, CMD A is not guaranteed to finish before CMD B is finished, unless you've used a VkSemaphore to explicitly state that CMD B depends on A.

Thus if you submit cmd buffers A without fences and without semaphores, and then B with a fence, you have no way to query when is A over (unless you submit new cmd C that issues a full barrier, or a vkDeviceWaitIdle).

thatcosmonaut commented 1 year ago

But I would suggest a slight separation, having fences still exposed to user:
SDL_Fence fence = SDL_GetCommandBufferFence( cmdBuffer );
SDL_WaitForFence( fence );

The problem with this approach is that you are still exposing the user to fence management - now every time they submit a command buffer they will have to wait for the command buffer to finish and then free the fence that the submit command returns, or they will have a memory leak, and the Vulkan validation layer will complain loudly. It would be a major headache to do this without blocking the main thread.

Since waiting for a fence is essentially the same thing as waiting for the command buffer to finish working, I don't see any good reason to force the user to think about fences. I've designed two systems this way and it's never ever been a problem. Some console APIs don't even have fences!

thatcosmonaut commented 1 year ago

While I'm here... I'm critical of referring to getting the presentation texture as GetBackbuffer. This is maybe pedantic, but I think the term "backbuffer" carries an implication that the memory will be stable across frames, which is not true in basically any modern API.

What I've done in my own API is have a call AcquireSwapchainTexture which takes a command buffer and returns the swapchain texture - if acquiring the swapchain texture is successful, the command buffer automatically sets up the presentation structure when the command buffer is submitted. If acquiring the swapchain texture was unsuccessful, the call returns NULL to let the user know. Doing it this way has some nice structural benefits for the Vulkan renderer at least, because we can automatically set up a semaphore to wait on the submitted command buffer being finished before presenting the swapchain image. If there's a problem with this approach, I haven't seen it yet.

darksylinc commented 1 year ago

Since waiting for a fence is essentially the same thing as waiting for the command buffer to finish working, I don't see any good reason to force the user to think about fences

No, it's not the same thing.

A command buffer can block literally hundreds of megabytes from being released. A fence does not.

I've run into this problem when building cubemap probes for reflections. A lot of temporary textures & buffers are created, and having a small array of 3x3 cubemaps with 3 split of PSSM means 6 faces x 9 cubemaps x 4 passes per face = 216 passes.

That's a lot of drawcall information stored in the cmd buffer(s). But it's not just that, you can't release tmp textures & buffers either until the cmd buffers have been reset/destroyed. Everything depends on those cmd buffers.

Carefully & properly flushing & releasing command buffers is the difference between running out of memory and generating these probes successfully. And you can't do that if you need to wait on a fence and that forces blocking the cmd buffer from being released.

You are right that my example code was flawed though, as the correct code would be:

SDL_Fence fence = SDL_GetCommandBufferFenceAcquire( cmdBuffer ); // increases ref count on fence
SDL_WaitForFence( fence );
SDL_FenceRelease( fence ); // We are done with the fence, decrease ref count

Unfortunately, you can't hide the fences unless the use cases will be trivial.

What you can though, is design an API so that fences are acquired from a cmd buffer, instead of creating them arbitrarily.

thatcosmonaut commented 1 year ago

I am extremely confused. You can't safely release data related to commands in a command buffer or the command buffer itself unless the command buffer has completed, and the typical way to know if a Vulkan command buffer has completed execution is to check if its associated submission fence is signaled. Am I missing something here? I know there are ways to do fine-grained command checks using vkEvent but that seems unrelated to what we are discussing here.

darksylinc commented 1 year ago

I am extremely confused. You can't safely release data related to commands in a command buffer or the command buffer itself unless the command buffer has completed, and the typical way to know if a Vulkan command buffer has completed execution is to check if its associated submission fence is signaled.

That is correct. But the reasons to wait on a fence are varied.

Yes, the primary reason to wait on a fence would be to know if its associated cmd buffer is done and can therefore be recycled. But a secondary reason to wait on a fence is because we want to know if a particular command (like a vkCmdCopy*) inside that command buffer has finished.

If that command has finish, it means it's safe to access the mapped region from CPU. Hence the user may want to hold on to a VkFence for longer than it wants to hold on to a VkCommandBuffer

Thinking out loud:

Perhaps it may be possible to not expose an SDL_Fence if an SDL_CommandBuffer internally recycles its VkCommandBuffer (which means the same SDL_CommandBuffer could be internally associated with many VkCommandBuffer across the lifetime of an app).

When SDL determines a SDL_CommandBuffer is over, it recycles and unsets its internal VkCommandBuffer, but still keeps its VkFence alive for wait/querying purposes.

That would make debugging harder though, when things go wrong (e.g. a seemingly-impossible crash says the SDL_CommandBuffer ptr is the same, but the user fails to realize the internal VkCommandBuffer ptr went stale because this implementation detail is hidden from him)

thatcosmonaut commented 1 year ago

If that command has finish, it means it's safe to access the mapped region from CPU. Hence the user may want to hold on to a VkFence for longer than it wants to hold on to a VkCommandBuffer

Why? What extra information does the fence contain?

darksylinc commented 1 year ago

Why? What extra information does the fence contain?

None.

It's simple derivation/deduction: If vkCmdCopyBufferToImage ABC is inside vkCmdBuffer DEF and VkFence X says DEF is finished, then by logical deduction ABC is also finished.

Which implies that whatever VkBuffer vkCmdCopyBufferToImage ABC writes to, can now be safely accessed.

thatcosmonaut commented 1 year ago

So why do we need a client-side fence handle when what we actually want to know is if the command buffer we submitted is done executing?

darksylinc commented 1 year ago

Because we want to recycle the command buffer handle ASAP while the fence still lives.

thatcosmonaut commented 1 year ago

I see... this also brings up the point that if we want to be able to query command buffer status using a command buffer handle, then the handles themselves have to remain valid beyond submission.

I think there are two directions we could go here: 1) Expose SDL_GpuFence as a resource and allow the user to optionally obtain the fence during command buffer submission. If the fence is obtained, the user is responsible for freeing it later, otherwise it can be automatically recycled internally when appropriate.

2) Do not expose fences and allow the user to query SDL_GpuCommandBuffer status. The user is responsible for releasing the command buffer from client-side whenever they want.

Either way, command buffers will have to be marked for recycling rather than recycled immediately because the implementation can only clean up resources and reset once it determines that the command buffer is done executing. Resource Destroy calls should not happen immediately, the implementation should do ref counting on resources to make sure they aren't referenced by any command buffers currently executing before they are actually destroyed.

qbojj commented 1 year ago

The problem

Some formats may not be supported by some APIs/drivers:

as it was already stated no Depth24_Stencil8 on iOS Metal, but there is Depth32Float_Stencil8
some APIs/implementations do not support certain vertex input/texture formats
In Vulkan there are only small subset of formats that has guaranteed for all flags in SDL_GpuTextureUsage (including the SDL_GpuVertexFormats use in vertex buffers).

Vulkan's required format support:

SDL_GPUPIXELFMT_B5G6R5 => VK_FORMAT_B5G6R5_UNORM_PACK16:
- not available
- VK_FORMAT_R5G6B5_UNORM_PACK16 is supported but not for storage image (shader read/write)
SDL_GPUPIXELFMT_BGR5A1 => VK_FORMAT_B5G5R5A1_UNORM_PACK16:
- not available
- VK_FORMAT_A1R5G5B5_UNORM_PACK16 is supported but not for storage image (shader read/write)
SDL_GPUPIXELFMT_RGBA8 => VK_FORMAT_R8G8B8A8_UNORM:
- always supported
SDL_GPUPIXELFMT_RGBA8_sRGB => VK_FORMAT_R8G8B8A8_SRGB:
- no support for storage image (shader read/write)
SDL_GPUPIXELFMT_BGRA8 => VK_FORMAT_B8G8R8A8_UNORM:
- no support for storage image (shader read/write)
SDL_GPUPIXELFMT_BGRA8_sRGB => VK_FORMAT_B8G8R8A8_SRGB:
- no support for storage image (shader read/write)
SDL_GPUPIXELFMT_Depth24_Stencil8 => VK_FORMAT_D24_UNORM_S8_UINT:
- no support*

*depth formats are strange:

D16 format is required to support sampling (only nearest sampling), blits from texture and using as depth attachment
D32F format is required to support sampling (only nearest sampling), blits from texture
Either X8D24 or D32F can be used as depth attachment
Either D24 or D32FS8 can be used as depth/stencil attachment
No depth format is required to support blits to the texture nor storage image (shader read/write)

Similarly with vertex input formats (I will not write them here, everything is written here)

Also textures from a swapchain may have different capabilities than those of normally allocated.

The (potential) solution

Some of those issues may be emulated, albeit with major performance degradation, by using storage buffers, emulating different formats AND samplers AND blending AND multisampling AND blits and probably even more...

Because of this I would suggest that formats should have a query if and with what usages is that pixel format supported and if that vertex format is supported. It would also ease the way into adding compressed/YCbCr formats.

ericoporto commented 1 year ago

the initial goals are D3D12, Metal, and Vulkan, and then things like GL4/GLES3/WebGPU can be considered (and possibly abandoned).

I think you are mixing WebGL and WebGPU.

icculus commented 1 year ago

Yes, I think I meant WebGL there; the current intention is to write a WebGPU backend, but that's probably coming after DX12, Metal, and Vulkan.

GunpowderGuy commented 8 months ago

@icculus Speaking of which, how does the sdl gpu api compare against the seemingly very similar webgpu? Both aim to provide a cross platform interface over current gen apis ( metal, vulkan and direct x 12 ) but with a simpler to use interface

Both will run on native platforms and on the web ( eventually for sdl gpu ) . And both introduce a custom shader language

As far as i know, the main difference is that webgpu applications running on the web will need to go through one fewer translation layer than sdl gpu ones

DanielGibson commented 7 months ago

What happened to the GPU API? I thought it was supposed to be shipped with SDL3, but as far as I can tell it's not in the SDL3 prerelease?

Is it dead (the branch hasn't had a commit in >5 months)?

GunpowderGuy commented 7 months ago

@DanielGibson The code may have been merged to main branch

ericoporto commented 7 months ago

Here's the PR for reference: https://github.com/libsdl-org/SDL/pull/7067

Also the other PR has this comment: https://github.com/libsdl-org/SDL/pull/9312#issuecomment-2008049746

there is no scenario where either this PR or my branch is landing in the 3.2.0 release in the name of expediency.

icculus commented 7 months ago

I've been crunching on all the other parts of SDL3; it's still coming, it's just not there yet.

DanielGibson commented 7 months ago

Ok, thanks for the clarification :)

icculus commented 3 months ago

So we're moving forward with the alternate implementation in https://github.com/libsdl-org/SDL/pull/9312 ...

It's pretty close to what I envisioned for the API (and most of the differences are things I would have changed to eventually when I ran into implementation problems later), and a whole team of talented graphics programmers have been focused on building it for several months now.

I would encourage you to look it over and give feedback on that PR before the API locks down.

Right now that PR does not have an SDL-specific shader bytecode (but is built so it can be added later, and I will be doing that work when SDL3 is solidified).

I intend to keep on with a custom shading language, because I think that in the right circumstances it will be amazingly useful, but the GPU API will not require it. When we get to the custom bytecode, we will make an effort to let other shader languages target it, either in patches to various compilers or a bytecode transpiler thing. That work is also separate from the new PR.

icculus / SDL

GPU API Feedback #1

Unsupported vertex formats

Merging format enums?

Format enum

More flags

More Texture types

Being very clear about 'depth'

What's the difference between depth and slice?

OpenGL

One-time mipmaps

SDL_GpuPipelineDescription's attachments

Why?

What's an accurate fence?

What about command buffers without fences?

Thinking out loud:

The problem

The (potential) solution