Modern Graphics API Thoughts

bjornbytes commented 5 years ago

High level notes/findings on what a modern LÖVEly graphics API could look like.

bjornbytes commented 5 years ago

I'm looking at Vulkan and WebGPU as APIs to target. Both of these have some features in common:

Command buffers
Render passes
Monolithic pipelines

Inside a command buffer, you can "enter" a render pass, bind a pipeline, bind buffers, and then submit draws.

Right now I'm most interested in the pipeline objects, and I'm still a bit confused about render pass objects (gonna talk about those later).

Pipelines store a lot of state, approximately:

Shader
Mesh info (buffer layouts and vertex attribute layouts)
sometimes some Canvas-related info depending on API (viewport, MSAA)
All the other global fixed-function state (culling, polygon mode, lineWidth, depth/stencil test, and blend mode).

In an ideal world, you're supposed to create all the pipelines you want to use upfront and then switch between them at runtime. For LÖVE/LÖVR, we have a highly scriptable immediate-mode API right now, so it isn't really feasible to have lovers specify all of the rendering details upfront. It seems like most applications in this situation do "last-minute" pipeline resolves at draw time, with plenty of hashing and caching to keep things fast (this blog post outlines this a bit). I'm aiming to use this technique, at least at first to keep things flexible.

I'm planning on making the following API changes:

Shader: Mostly the same I think. May need to start using uniform buffers more internally.
Mesh: To make meshes map better onto the Pipeline, I think they should only hold the vertex/buffer format info, and then point at some sort of generic Buffer object.
- There are existing problems with Meshes referencing the data of other Meshes (cycles), so this would help out with that.
- Buffers could be used for lots of different things, not just vertex data.
Pipeline Add some sort of object for holding the current global fixed-function state.
- So far I've been calling it Pipeline internally, which is a little confusing because it is different from the monolithic GPU pipeline, so might need to rename that. But I think Pipeline works okay for a Lua-facing name.
- Because this isn't a full monolithic pipeline, it doesn't map to a real GPU concept and is mostly just for convenience/organization.
- There is the stackable pipeline idea where multiple pipelines can be composed together and override each other (maybe with like "dontcare" values for some of the states). This can be really expressive. Lua API will be fun to figure out.
Samplers? No idea yet. We're supposed to use these but they're so tedious.
Canvas
- Planning on making Canvas attachments immutable (I don't think LÖVE has this problem but LÖVR does).
- Instead of clearing/discarding, just going to somehow have people specify their desired load/store ops at creation time?
- If Canvas is still just global graphics state, can I diff that at draw time and lazily begin/end renderpasses?
- Alternatively, because render passes are more of a "toplevel" concept in these modern APIs, they could be core objects in the Lua API that you begin/end and hang state and methods off of. slime has done some initial experiments on this approach.
Batch
- An interesting thing is that all of these APIs have a means of prerecording and then resubmitting commands. Vulkan calls it "secondary command buffers" (ew), D3D12 and WebGPU call them "render bundles". It would be cool to see if this can somehow get exposed. Would be great for avoiding repetitive Lua->C overhead. A bit similar to SpriteBatches, but more general and backed by a real GPU concept!

Sorry if this is all still a bit disorganized, still learning and organizing thoughts!

bjornbytes commented 5 years ago

Threading: Vulkan drivers aren't threaded like OpenGL drivers are, leaving it up to the application. I can think of two different ways of taking better advantage of multithreaded rendering:

Explicit: People use the graphics module in the existing Thread and Channel APIs. This would mean the default path (one function for the draw callback) is "slow", but gives people more control over how they split up and optimize their rendering code.
Implicit: There are internal worker threads in the graphics module, and all drawing commands are distributed across those threads. This means that rendering is always multithreaded which is good, but the behavior is more rigid and harder to tune for a specific project. Also, everything is still going through a single Lua state on a single thread, so there may be a bottleneck there.

slime73 commented 5 years ago

Here are some of my own thoughts / where my head is at right now. A lot of it matches up well with your notes, I think.

All drawing and GPU state-setting functions will be removed from the global love.graphics API.
New concept & object: Render Pass. It contains some setup info:
- which Canvases (if any) the RenderPass will render to. (Side note: a Canvas is just a texture that's tagged saying it can be rendered to using a RenderPass).
- what happens to those textures when the render pass begins (whether they get cleared, and what clear color if so), and optionally what happens to them when the render pass ends (you might want to discard depth/stencil data at the end of a render pass because it's no longer needed and doing so saves performance)
a RenderPass has methods to queue state-setting and drawing commands, which will only be executed when a new function love.graphics.execute(renderpass) is called (naming TBD).
This is pretty much the same concept as vulkan / metal render passes, just at a higher level of abstraction.
Because commands are enqueued instead of executed immediately, it has some pretty big implications for the use of other love.graphics state and data objects – for example if you set the vertex positions on a mesh, queue a draw command for that mesh, and then change the mesh vertex data again before executing the render pass, the draw operation would only reflect the latest changes done after enqueuing the draw command, not before.
There are some things that users will probably want to repeatedly change within a render pass, which is currently only set-able via other objects. The main thing I'm thinking of is shader uniforms. My thought right now is to keep the existing APIs for setting uniforms on a Shader object (or have a very similar version), and add new methods to Render Pass objects to set a shader's uniform which only lasts for the duration of the render pass (or until it's overwritten within that render pass). I haven't ironed out exact specifics of that, though.
I also really like the idea of stackable pipeline state objects that can be applied to a RenderPass. In my head I've been calling them Graphics State objects rather than pipelines. Perhaps local uniforms could be set there as well.
- For now I've been thinking I'll keep individual state-setting methods on Render Passes as well for convenience, and Graphics State objects can be used when people want to make the fullest use of the new APIs.
I also want Buffers exposed more generally. A Mesh's buffers could be attached to other meshes, instead of attaching meshes themselves together.
- Uniform/Constant Buffers have tricky packing rules, and their data-setting APIs might need to be a bit specialized compared to vertex buffers or Shader Storage Buffers / Structured Buffers. For now I'm not really thinking about being able to interpret a uniform buffer's data as a different type.
- Buffers (and textures) that are writable from compute shaders need some extra care to make sure their data is synchronized. It's not something I've thought about a lot, but it's something to be careful of.
I'm thinking textures will still have their own sampler object instead of completely splitting them, but maybe there can be a RenderPass method to make a texture use different sampler state for the duration of the render pass (or until it's overwritten in that render pass).
Command buffers might not need to be exposed as an external API concept. My thought is that the love.graphics functionality outside of render passes (such as executing a compute shader or copying vertex data) is "immediate" and the implementation uses command buffers as appropriate under the hood. Maybe a single vulkan command buffer can be used for multiple love.graphics.execute calls if they're small, or something. That said, I haven't thought about async compute using a separate compute queue. That's probably overkill for now, anyway.
Since running a compute shader would happen outside a render pass, it wouldn't have a way to set one-off short-lived uniform data (without using Shader:send or whatever), but I think it should. I'll have to think about that.
For making use of a low level graphics API's internal prerecording capabilities, maybe a render pass could either have a method to "compile" it for reuse, or a flag on creation or something.

bjornbytes commented 5 years ago

I'm slowly starting to come around to the "render pass" object idea. I was originally uneasy because it doesn't map directly onto one of the GPU concepts, but I realized that it's a really approachable/convenient way to structure a game.

The thing that helped it click for me was thinking about them as "layers". Like how in GDC postmortems or rendering breakdowns, they always present each layer (pass) of the frame individually and layer them on top of each other to get the final result. So for a simple game you might have your static terrain/tile layer, a layer for characters/enemies, and a layer for the UI. Usually each of these are separate render passes with their own state/objects, and so if the LÖV API presented something that let people express that, it would be a pretty big win. Even if it doesn't map directly onto a pipeline/renderpass, it still makes it way easier for the underlying LÖV implementation to do so.

I'm not sure but it seems like your RenderPass is going to store a list of commands in memory, and then serialize them to the command buffer/encoder at the time of execute. I'm going to try to do something different and record the commands to the API directly, so that I don't need to store additional memory and reduce overhead a bit. It seems like there are several reasons why this won't work, but I'm still going to try.

Somewhat related -- it could be confusing to have object state only used when passes are executed. The (current) alternative is to sprinkle flushes all over the place, which makes the API nicer but the implementation more annoying. I'm curious if that's still possible in the render pass setup or if it would be prohibitively expensive/complicated. Hmm.

After more research I understand why prerecorded command buffers might not be as necessary as I thought -- modern APIs are way faster at enqueueing draw calls than OpenGL, so it isn't a big deal to do that over and over again. There are still 2 reasons I might be interested in it A) reducing Lua-C overhead of enqueueing large numbers of unchanging draws B) optimizations that can be done when the set of draws are known (culling, sorting, more relevant for 3D, but I kinda lean towards pushing this to Lua anyway since it's so app-specific).

1 sampler per texture seems like a good approach. It can't really be worse than whatever is going on in OpenGL today. It looks like Vulkan drivers still do caching of samplers anyway. Maybe the lov.graphics default filter can be a global "cached" sampler.

I'm trying to get other work out of the way so I can focus more on implementing this stuff!

bjornbytes commented 4 years ago

Finally started laying the groundwork for this on a branch if you're interested in lurking:

https://github.com/bjornbytes/lovr/compare/gpu

Really just Vulkan boilerplate at this point.

bjornbytes commented 3 years ago

Finally worked up the masochism to start working on this stuff again.

Implemented this API for Texture views

lov.graphics.newTexture(texture, TextureType, firstLayer, layerCount, firstMip, mipCount)

The layer/mipmap stuff is optional. Could also make it a newTextureView function instead of further complicating newTexture.

I haven't tried using it yet but it may end up feeling nicer than passing around { texture, layer, mipmap } tables for texture attachments. It matches the modern APIs better and allows for more powerful stuff (texture type reinterpretation, maybe depth/stencil view stuff or swizzling in the future?).

EDIT: Also added Texture:newView(type, layer, count, level, count).

bjornbytes commented 3 years ago

The pass / command buffer API I'm going to try out is a lov.graphics.render function with two variants. The first one is:

lov.graphics.render(target, function() end)

target is the usual setCanvas table describing the attachments, load/store ops, etc.

This one is like Canvas:renderTo. It begins a (cached) render pass, calls the callback containing regular lov.graphics draw calls, and finishes the pass. Any graphics state/bindings set in the callback is temporary to the callback.

The second variant is for multithreading

lov.graphics.render(target, ...batchnames)

You pass in names of prerecorded batches you want to replay. Batches are (secondary) command buffers that can be recorded concurrently. There is a lov.graphics.record function for this:

lov.graphics.record(target, 'nickname', function() end)

You pass in the target you're going to replay on (sadly this is needed for vulkan/webgpu), a name to use for later replays, and a callback similar to the first variant. The batches are temporary and can only be submitted in the same frame they're recorded. The names are used instead of regular userdata to make it easier to use them between threads, avoid GC, and because there's just not a lot of benefit to retaining them since they're temporary.

(I want to explore more persistent batch objects later, but those are wayyy more challenging. They'd at least need to refcount all resources they use and potentially keep around copies of all the temporary matrices/uniforms).

One thing I like about this is that there are less breaking changes to the graphics module. A lot of code that is just setting state and drawing primitives/Drawables in lov.draw will continue to work. That wasn't the case when I was considering Batch/Pass objects.

I decided against the in-memory representation for the command buffers, at least for now. It has some benefits (you can sort/cull/reorder the draws, inspect/serialize the commands), but I really like the low-level approach where your graphics functions in Lua immediately hit GPU command buffers.

One kind of cool thing is that boot.lua can do lov.graphics.render(windowTarget, lov.draw). It might end up being more complicated than that if people want to do do other passes in the draw callback or submit batches instead. Maybe just a conf.lua flag though.

I'll report back on how it goes, I have to reorganize a bunch of command buffer/pass/framebuffer stuff first, may run into issues.

EDIT: Mostly dropped this due to design flaws. It almost worked, but in the end wrapping it in a Canvas / Pass object is preferable because you can be recording multiple passes at once and it avoids some clashes with global state. It's also just more lovely. So I am fully on board with Pass objects even though I was somewhat against them at first. I still have a function lovr.graphics.renderTo(textures|canvastable, function(canvas) end) for doing temporary render passes.

bjornbytes commented 3 years ago

Added depth bias and depth clamp states. Not really anything special.

Considering making blend modes and color masks per-target instead of global.

Current idea is for setBlendMode and setColorMask to take an optional target index, and if it's missing it applies to all targets (backwards compatible)

lov.graphics.setBlendMode('add') -- applies to all targets
lov.graphics.setBlendMode(1, 'add') -- only applies to first target

I'm not sure how the getters should work. They could either take an optional target index that defaults to 1, or they could return everything if the target is missing. It might be weird to have getColorMask() return 16 booleans...

bjornbytes commented 3 years ago

Here are the 3 types of buffers now (somehow they ended up matching opengl's roughly)

static: uses dedicated gpu memory, can not be written to from cpu, can be copied to, can have initial contents specified, can write to it from GPU
dynamic: uses cpu-cached memory, contents are preserved between frames, can be written to, can not be written to from GPU
transient: uses special 256MB block of AMD/NV memory that is both device local and host visible (or uncached memory if that's unavailable), contents are undefined at frame start, can be written to (sequential writes encouraged to make use of write-combining), can not be written to from GPU

dynamic/transient are double buffered, and can not have storage usage

I can't imagine metal needs to worry about any of this...

EDIT: Actually dropped the 3 buffer types thing. Instead I'm using usage flags to detect what type of buffer memory to use (write flag says whether you want to write to it from CPU, transient (TBD) flag says whether it's okay to discard contents at the beginning of a frame).

bjornbytes commented 3 years ago

Considering only having 3 draw modes for the 'raw' drawing functionality like Mesh: points, lines, triangles. This seems to be more in-line with how D3D12/Metal do things. There will still be a primitive called line that will draw a line strip, but internally it will just use the lines draw mode plus an index buffer (unsure of performance caveats here).

Mm I guess love doesn't need to worry about lines as much since they're already polylines.

bjornbytes commented 2 years ago

Here is my API for queries:

lovr.graphics.newTally(type, count)
TallyType is either "time", "pixel", "shader"
- time counts elapsed nanoseconds
- pixel counts visible pixels (occlusion query)
- shader counts pipeline statistics (vertex count, vertex/fragment shader invocations, (un)clipped primitives)
Pass:tick(tally, index) and Pass:tock(tally, index) begin and end a query in the tally (might rename)
Pass:read(tally, index, count) returns a Readback with the query results
Pass:copy(tally, buffer, srcindex, dstoffset, count) copies tally results to a buffer
When reading or copying timestamps, a compute shader is dispatched to subtract start/end timestamps and convert them to a consistent nanosecond unit.
Currently you can't tally a whole pass, just its contents. It would be good to add a way to associate a tally with a pass so you can time e.g. the load/store operations and any internal post-pass work like mipmap generation.

bjornbytes commented 2 years ago

I just merged my Vulkan branch, so I won't be in brainstorming/API design mode as much. This issue was fun to have as a diary

slime73 / love-experiments

Modern Graphics API Thoughts #1