Plan for multisampled path rendering

raphlinus commented 1 year ago

This issue outlines a plan for fast rendering of paths, solving conflation artifacts. The results should match MSAA (multisampled antialiasing) closely, but the antialiasing is done in compute rather than relying on GPU hardware for MSAA resolve. Thus, results are consistent across a wide range of physical devices.

Conflation artifacts are described in section 4.1.2 of Kilgard and Bolz, and also discussed in some detail in #49. Draft PR #64 explores a supersampling approach but is not suitable for stroking because it only implements even-odd winding rule. This proposal focuses on the nonzero winding number case.

This proposal only addresses conflation artifacts within a path, so the result is an alpha mask. It does not address conflation in compositing. Changes to clipping are optional.

Changes to coarse rasterization

For the most part, this proposal is focused on changes to fine rasterization. However, for best performance, coarse rasterization should change to providing path segments inline and contiguous to the per-tile command list, rather than the current linked-list situation. For prototyping, this could be done as a separate stage, but this is also part of the "stroke rework" plan.

Outline of approach

The fundamental problem is computing an integer winding number for each sample of each pixel in the tile. We'll use 8x MSAA for concreteness, but the ideas work for other scaling factors.

The overall approach is sparse and hierarchical. The "backdrop," which is the winding number of the top left corner of the tile, is computed in coarse rasterization and passed in as a parameter in the per-tile command list. The hierarchy continues - a winding number is computed for the top left of each pixel, then within a pixel there are 8 winding number values. The total winding number of a sample is the sum of these three values.

The accumulation of winding numbers is done using atomics. Because there are so many values to compute (2048 for 8x MSAA in a 16x16 tile), the per-sample winding numbers are packed 4 to a u32 word, allowing 8 bits of range. For pathological input paths which generate a winding number more than 128 in magnitude, artifacts can be expected, but this is also the case with existing stencil-and-cover renderers that use an 8 bit stencil buffer (including Skia's partially GPU-accelerated path rendering), and to my knowledge these artifacts have not been observed in the wild.

Unlike the current fine rasterizer which is dense (every pixel computes the same program), the new plan is sparse and relies heavily on cooperation between threads.

The outline of the approach is:

Clear atomic buffers
Loop over the path segments in wg_size chunks:
- Each thread reads one path segment
- The thread computes the number of pixels touched by that segment
- Workgroup-scope prefix sum of pixel count
- Loop over total pixel count in wg_size chunks:
- Use binary search to find segment and pixel
- Does segment cross top edge of pixel?
  - If so, atomically increment winding pixel winding number
- Fetch half-plane mask from texture/buffer
- And with y coverage mask
- Expand 8 bit mask to 2x u32s
- Atomic add expanded bit masks to sample winding number
Barrier (so all atomic operations settle)
Prefix sum pixel winding numbers in x direction
Resolve
- Add backdrop + pixel winding number + sample winding number
- Apply nonzero winding number rule
- Count bits in resulting mask
- Apply mask value (either output directly or use as alpha for brush)

Subsections below expand on many of the steps in this outline.

Loop over the path segments in wg-size chunks

Loop needs to be over (n_path_segments + wg_size - 1) / wg_size. Care needs to be taken this is uniform (as prefix sum below will have barriers and thus need to be in uniform control flow), but that might not be a problem as n_path_segments is fetched from ptcl buffer which is bound readonly.

Segment index is loop index * wg_size + local_ix. Load of segment needs to be gated on segment index < n segments.

Count pixels touched by line segment

This is basically $1 + | \lfloor x_1 \rfloor - \lfloor x_0 \rfloor | + | \lfloor y_1 \rfloor - \lfloor y_0 \rfloor |$.

Also set up segment parameters for later steps here.

Loop over total pixel count in wg-size chunks

Use this pattern:

    for (var i = local_ix; i < total_pixel_count; i += wg_size) {
        ...
    }

Binary search to find segment and pixel

The prefix sum of per-segment pixel counts is in a shared memory array; the exclusive prefix sum represents the start index of each segment. Binary search this array, using the index as a probe. That identifies a segment. The index minus the start index is the pixel number within that segment (pixel_ix).

The x, y pixel location can then be computed:

    let z = i32(floor(segment.a * f32(pixel_ix) + segment.b));
    let x = segment.x0 + pixel_ix - z;
    let y = segment.y0 + segment.y_sign * z;

The segment struct values need to be pre-computed to make this work. I have the math prototyped but not yet written up. Care is needed to ensure numerical robustness.

Fetch half-plane mask from texture/buffer

This is basically as in #64, the getLut() function. However, it might be done as a buffer load, and additionally these may be stored 4 to a u32.

Also note this is very similar to the approach of Li et al.

And with y coverage mask

This is as in #64, the vertLut() function and the calculation of y_range.

Expand 8 bit mask to 2x u32s

    // input: m: u32 has mask in lower 8 bits
    let a = (m << 18) | (m << 12) | (m << 6) | m;
    let a0 = a & 0x1010101;
    let a1 = (a >> 1) & 0x1010101;

Multiply by sign of winding number. Note: integer multiplication on GPU can be slow, consider storing sign as b = 0 for s = +1 and b = -1 for s = -1, then x * s is (x ^ b) - b.

Atomic add expanded bit masks to sample winding number

This is just two atomic adds to the sample winding number array, targeting the (x, y) pixel location.

Add backdrop + pixel winding number + sample winding number

This can be done 4 to a u32. Fetch both atomic words. Expand backdrop + pixel winding number, effectively multiplying scalar by 0x1010101.

A good way to deal with sign and overflow is to bias the scalar by 0x80 before expanding. Then each byte is a value 0..255 instead of -128..127.

Apply nonzero winding number rule; Count bits in resulting mask

   a0 ^= 0x80808080;
   a1 ^= 0x80808080;
   a0 |= a0 >> 1;
   a1 |= a1 << 1;
   var b = (a0 & 0x55555555) | (a1 & 0xAAAAAAAA);
   b |= b >> 2;
   b |= b >> 4;
   return countOneBits(b & 0x3030303);

There might be a slightly better way to do this exploiting carries, but this is pretty tight already.

Clipping

A clip mask could be stored as a mask value, rather than an alpha value for Porter-Duff compositing as now. Much of the operation is as above, but additionally and the clip mask before resolving, to produce a final alpha value.

A clip stack could potentially be stored as 1 stack depth value per MSAA sample, requiring O(log n) bits rather than O(n) as for the blends stack now. This technique is used in the Contrast renderer, cited below.

References

raphlinus commented 1 year ago

A couple more thoughts. First, here's the relevant function from the JavaScript implementation to prove out the parallel line rendering:

    calc_line() {
        let x1 = this.p1.x / this.grid;
        let y1 = this.p1.y / this.grid;
        let x2 = this.p2.x / this.grid;
        let y2 = this.p2.y / this.grid;
        if (x2 < x1) {
            [x1, y1, x2, y2] = [x2, y2, x1, y1];
        }
        const x1i = Math.floor(x1);
        const y1i = Math.floor(y1);
        const x2i = Math.floor(x2);
        const y2i = Math.floor(y2);
        const dx = x2 - x1;
        const dy = Math.abs(y2 - y1);
        const idxdy = 1 / (dx + dy);
        const a = dy * idxdy;
        const sign = Math.sign(y2 - y1);
        let c;
        if (y2 > y1) {
            c = y1 - y1i;
        } else {
            c = 1 + y1i - y1;
        }
        const b = (dx * c + dy * (x1i + 1 - x1)) * idxdy;
        const n = 1 + Math.abs(x2i - x1i) + Math.abs(y2i - y1i);
        const result = [];
        for (let i = 0; i < n; i++) {
            const z = Math.floor(a * i + b);
            const x = x1i + i - z;
            const y = y1i + sign * z;
            result.push({x, y});
        }
        return result;
    }

The part inside the for (let i) loop is clearly parallelizable, as (in rather sharp contrast to Bresenham-like approaches) there is no dependency from one cycle of the loop to the next. Attached is a screenshot of my interactive testing tool (not yet published).

The second observation is that when the winding number for the tile is known never to exceed ±7, they can be packed 8 to a u32, halving the number of atomics. This is implied by the number of line segments in the tile being < 8, which can be detected readily and uniformly per-tile (no divergence concerns), and is probably quite common. This should also make 16x MSAA more practical, and decrease the gap with even-odd (thus reducing the motivation to analyze paths further upstream to determine whether even-odd and nonzero would render the same).

DJMcNab commented 2 months ago

https://github.com/linebender/vello/pull/324#issuecomment-1764814988 indicates that this might have been resolved. cc @raphlinus

linebender / vello