Sparse strip path rendering

This is the tracking issue for sparse strips (not the design document).

Sparse strips represent the next generation of GPU path rendering, promising better performance and more flexibility in integrating in other systems. It is also a critical step toward conflation artifact free compositing.

The quick summary is to generate tiles (similar to Vello now, but probably smaller), sort them, then do a rendering operation based on “boundary fragment merge” in the Li et al scanline paper. The result is essentially a run length compressed rendered path, containing all the grayscale antialiased pixels at the boundary of the path but sparsely representing solid regions.

More details are in the Sparse strips document. Additional background is Scanline meets Vello doc, and lots of discussion in Zulip threads, the main one being Sparse strip path rendering.

The output of the sparse strip step can be consumed in multiple ways; the strips can be rasterized as in Li, they could be adapted to the existing Vello renderer, or a new approach to coarse rasterization could be designed. The approach is considerably more modular than Vello main, as there is a well-defined, efficient representation of the intermediate result (rendered path), rather than having it spread out across multiple stages.

Ideas for conflation free compositing are in the Zulip thread Conflation artifact free compositing. That should also get a tracking issue, but is considerably later on the roadmap.

Milestones

The first milestone is a research result to validate the primary hypothesis: that rendering works and is substantially faster than Vello main. In this first milestone, rendering will be done by just doing a draw call, two triangles back to back for each strip, with the fragment shader reading the alpha values from the strip and outputting a solid color, for final blending by the GPU hardware. This prototype will not have clipping or blending capability. Even so, it is a good basis for a research paper.

The next milestone is integration into the existing Vello pipeline, with good performance (should be better than Vello main) but not necessarily the end state, and with no regression of capability. A plan is detailed in a Zulip thread.

Another milestone is implementing MSAA. The importance of this depends on applications, and is also of course needed for conflation free rendering. An expectation is that the performance penalty for higher MSAA levels is small (because the algorithm is more efficient, and also because we can adaptively do the accumulation of winding numbers using 4 bit accumulators, see #391).

Farther down the roadmap, it’s tempting to redo coarse rasterization with a second sort. Among other things, that should solve the zoomed-out performance issue (#419), as the steps which are currently serialized and poorly load balanced in that case would become fully parallel.

Things going away

Several things in the existing pipeline go away:

The allocation of the tile array (and thus the tile_alloc stage)
The backdrop stage (currently a performance bottleneck in simple scenes)

Further, line soup could go away, as tile generation could be fused with flattening. Bounding box calculation should probably move from flattening to post strip generation (and can be done with a segmented monoid reduction, see #259).

Estimation should become easier and more reliable as there are fewer data-dependent intermediate data structures. Also, because memory usage is not dependent on bounding boxes, it is less sensitive to rotation transforms (consider a horizontal vs diagonal line).

Sorting

Sparse strip rendering depends on sorting, so it is critical to the overall performance. The initial prototype will be done with WebGPU sort because it’s the simplest, but higher performance is definitely possible. Better sorting algorithms are definitely possible, and will likely happen organically even if we don’t drive them.

Thomas Smith has been doing extensive investigation into segmented sort algorithms, and the Zulip thread contains a detailed analysis. It is likely that such a segmented sort approach will be high performance.

Thus, I consider sorting to be fairly low risk even if we find it is a performance concern in the initial prototype.

Performance tuning

In addition to sorting, there are other things that can start out simple and be fine-tuned for performance later. One is the tile size. Initially, 4x4 is simpler (with 1x4 columns internally), but 8x8 (with double sparseness so 2x8 then 2x2) would have approximately half as many tiles to sort, and also half as many strips to render.

Another is the primitive. The prototype will be with lines because we have them, but one outcome of the stroke expansion work is that arcs are also viable, and could be considerably fewer segments, thus fewer tiles to sort. A caveat - if the workload is primarily lines, then arcs may not be an improvement.

Subgroups are expected to unlock further performance improvements (particularly the partition-wide prefix sum for column counts for load-balanced work assignment in merge; basically the same operation exists in msaa fine and we have evidence that’s a performance bottleneck).

Retaining rendered paths; glyph caching

One potential huge advantage of sparse strips is that the rendered path can potentially be retained across scenes, rather than re-rendered from scratch each time. Similarly, if the same rendered path is repeated multiply, it can be rendered once. This is essentially the same functionality as glyph caching, but potentially more flexible.

Glyph caching becomes very inefficient as glyphs scale in size, as it’s bound to a dense representation. In addition, as glyphs scale, rasterization can become more efficient as it’s sparser (fewer fragment shaders assigned to zero alpha). The allocation problem also becomes easier because it’s not necessary to do 2d allocation in a texture atlas.

Exploring the performance benefits of glyph caching is a major motivation for moving to sparse strips.

Summary

Sparse strips are difficult and a fair amount of work, but will yield higher performance, unlock new capabilities (especially conflation free compositing), and offer considerably better integration into other systems. I believe they are the future of GPU path rendering.

linebender / vello