Proposal to split the draw stage into two passes

smoogipoo commented 4 years ago

This is documenting something that's been on the back of my mind for a while now, so that documentation isn't lost for the direction I want to take.

What is a `DrawNode`?

A DrawNode is the logic required for the draw stage to know how to draw a Drawable to the screen. It's the earliest point where execution is deferred to the draw thread for rendering, meaning it has to maintain the hierarchical structure of Drawables. It's logic in that it tells the draw thread what vertices need to be drawn, what primitive type to use, whether masking needs to be enabled, whether a framebuffer should be drawn to, etc.

But right now it serves a secondary purpose which is to paint the screen by sending GL commands. This becomes a bottleneck for optimisation since there is no retained knowledge of what a DrawNode actually does, except while the screen is being painted inside each DrawNode.

Going forward I propose a split of the draw stage into two passes, which I'll call Draw and Paint (name tbd).

The "draw" pass

This pass traverses the DrawNodes and invokes Draw() on each one. In turn, the DrawNodes return a set of Intents (name tbd) that are aggregated into a single array.

Examples of Intents include:

Begin(drawNode) - Indicates the start of a DrawNode's draw procedure.
UseBatch(type, length) - Indicates the start of a vertex batch of a specific primitive type.
PushVertex(vertex) - Indicates a vertex should be added.
SetMasking(maskingInfo) - Indicates that masking should be set.
UseFrameBuffer(width, height, formats) - Indicates that a frame buffer should be drawn to.

The final structure will likely look different to the above, but the general idea remains - DrawNodes return a list of intents rather than directly painting to the screen.

The "paint" pass

Using the list of intents generated through the draw pass, the paint pass aims to convert these intents into GL commands.

This is also where optimisations takes place, since the conversion is done in three stages:

Apply any available first-chance optimisations. This is a linear pass through the list of intents which adjusts the intents to consider their context.
Convert the intents into GL commands.
Defer the final intent list to an background thread for further optimisation.

First and second-chance optimisations

First-chance optimisations are immediately available to the paint pass. They require a single linear pass over the hierarchy and generally only deal with the hash code of a DrawNode.

Example optimisations that can be done:

Merging subsequent requests for vertex batch usage.
Adjusting the grouping of vertex batches to reduce the frequency batch updates (imagine something like a garbage collection algorithm).
Subdividing framebuffers to be used across framebuffer usage requests.
Re-ordering intents to merge disjoint sets of non-masked vertices separated by masking.

However not all optimisations are immediately available since they may take additional processing that would slow down the painting process. These are called second-chance optimisations as they become available 1 or more frames in the future. They are computed on a background thread that is given the final intention list, after which they are upgraded to first-change optimisations and become available in the next frame.

The structure I'm imagining is similar to the following:

Update(root: Drawable):
    root.UpdateSubTree()
    rootDrawNode: DrawNode = root.GenerateDrawNodeSubTree()
    Draw(rootDrawNode)

Draw(root: DrawNode):
    intents: List = root.GenerateIntentsSubTree()
    Render(intents)

Render(List intents):
    foreach (bi: BeginIntent in intents)
        optimisations: List = optimisationDictionary[bi.Hash]
        foreach (o: Optimisation in optimisations)
            if (o.IsValid(intents))
                o.Apply(intents)
            else
                optimisations.Remove(o)

Notes:

The list of intents may be better suited as a linked list.

This structure opens up a lot of possibilities for us: 

Optimisations become context aware and are able to drill down further than they currently can.
Optimisations can be hardware specific. We can enable/disable optimisations depending on the GPU used.
The render pass becomes the single source of graphics API command generation. As we look towards supporting Vulkan/Metal/DX9/10/11/12, we can define a different render passes for each one that makes better use of the constructs implemented in those APIs. We'll have to investigate whether this would be useful in the context of Veldrid, but we aren't 100% sold on Veldrid either at this point.

bdach commented 4 years ago

Huh, so a sort of intermediate representation for draw calls with just-in-time optimisations... Sounds both viable and very cool in concept. The devil is in the details however and I'm afraid there might be very many details as it's graphics APIs we're talking about.

Out of curiosity, have you already hit a situation where you were lacking context knowledge to apply a known optimisation?

smoogipoo commented 4 years ago

Before tightening up vertex batching parameters a while back, I did try to implement the GC-like idea. That requires knowing exactly which vertices are going to be drawn ahead of time and whether they’ve changed/how often they’ve changed. It also requires knowing when a batch change is required (e.g. via masking or uniform changes).

I imagined “generations” of vertices, from a ephemeral (“streaming”) pool containing always-changing vertices, a “dynamic” pool above that that contains vertices that don’t change every frame, and then a “static” pool even further above that contains unchanging vertices. That would reduce the number of uploads, and further heuristics could re-order vertices in the pools to reduce the number of batch changes changes (dependent on the aforementioned masking change and other heuristics like whether regions overlap, etc, that would be treated as a second chance optimisation in my model).

Edit: In general I think giving context opens up a lot more optimisation potential than right now, if only as a result of deferring the actual draw stage. Even in the current models compositedrawnodes are guessing the amount of children actually drawn for their batching, which is why batches can overflow and require careful tuning.

ppy / osu-framework