stb_truetype enhancement: more elegant curve flattening

this may be helpful on performance when we want more precision (less tolerance when flattening curves).

Flattening quadratic Béziers by Raph Levien

Is performance a problem?

Is performance a problem?

yes and no, depending on what's rendered. for example, a page of news written in traditonal chinese (or even egyptian hieroglyph).

the number of distinct characters used in a page may be larger
the number of strokes (very common to be denser than pixels on low-dpi devices)
the precision desired (in fonts of particular styles, the stroke may be thin that might need tolerance less than 0.35f) these together will likely make some trouble.

(also hinting is usually used for even better quality, but that's another story. still, pixelated chinese fonts (hand crafted for each small sizes) are also widely used solution)

an extreme case is viewing chinese character charts where no duplicate work could be cached, though it's not common usage. i haven't tested on this on any stb_truetype powered application yet, but on windows notepad, chrome and edge, sensible lag could be produced with not-so-large chart at normal text size.

and embedded devices, you know. small code for small devices.

after all, it's an enhancement implemented in tens of lines of code (not much longer than recursive subdivision) that brings predictable saving (usually somewhere between 10% to 40% less segments produced). when the case comes, it just helps.

edit: basically here's almost everything needed

// Compute an approximation to int (1 + 4x^2) ^ -0.25 dx
// This isn't especially good but will do.
function approx_myint(x) {
   const d = 0.67; 
   return x / (1 - d + Math.pow(Math.pow(d, 4) + 0.25 * x * x, 0.25));
}
// Approximate the inverse of the function above.
// This is better.
function approx_inv_myint(x) {
    const b = 0.39;
    return x * (1 - b + Math.sqrt(b * b + 0.25 * x * x));
}
// below in class QuadBez{
    // Determine the x values and scaling to map to y=x^2
    map_to_basic() {
        const ddx = 2 * this.x1 - this.x0 - this.x2;
        const ddy = 2 * this.y1 - this.y0 - this.y2;
        const u0 = (this.x1 - this.x0) * ddx + (this.y1 - this.y0) * ddy;
        const u2 = (this.x2 - this.x1) * ddx + (this.y2 - this.y1) * ddy;
        const cross = (this.x2 - this.x0) * ddy - (this.y2 - this.y0) * ddx;
        const x0 = u0 / cross;
        const x2 = u2 / cross;
        // There's probably a more elegant formulation of this...
        const scale = Math.abs(cross) / (Math.hypot(ddx, ddy) * Math.abs(x2 - x0));
        return {x0: x0, x2: x2, scale: scale, cross: cross};
    }
    // Subdivide using fancy algorithm.
    my_subdiv(tol) {
        const params = this.map_to_basic();
        const a0 = approx_myint(params.x0);
        const a2 = approx_myint(params.x2);
        const count =  0.5 * Math.abs(a2 - a0) * Math.sqrt(params.scale / tol);
        const n = Math.ceil(count);
        const u0 = approx_inv_myint(a0);
        const u2 = approx_inv_myint(a2);
        let result = [0];
        for (let i = 1; i < n; i++) {
            const u = approx_inv_myint(a0 + ((a2 - a0) * i) / n);
            const t = (u - u0) / (u2 - u0);
            result.push(t);
        }
        result.push(1);
        return result;
    }

Is there any evidence that the number of polylines would be the decisive factor in any performance issues?

Like, basically, this is never going to happen because it doesn't seem very important and who wants to create a bunch of work for themselves for no reason. stb libraries don't generally claim to be the most performant solutions around, they're primary goal is ease of use, so it just doesn't sound like it's important enough to be worth the work.

i ported the code and did some test with a widely used font (Microsoft Yahei, a sans-serif UI font, quadratic curves), rendering the Chinese range \u4e00-\u9fa5 each once (stbtt_MakeCodepointBitmap).

ubuntu 18.04, gcc 7.5.0, compiled with with -Ofast.

depending on tolerance (0.01f to 0.35f) and different size (12px to 64px), it ranges from being ~5% slower to ~25% faster. it wins more on smaller tolerance (~20% on0.01f, ~10% on 0.1f) case, indicating the number of polylines actually impacts performance. for larger tolerance, they're roughly on par.

it's worth noting that this font looks quite serious and the majority of contour are straight and axis-aligned. fonts that are more stylistic might enlarge the difference.

have a try?

diff --git a/stb_truetype.h b/stb_truetype.h
index bbf2284..5d66b2d 100644
--- a/stb_truetype.h
+++ b/stb_truetype.h
@@ -3572,6 +3572,59 @@ static int stbtt__tesselate_curve(stbtt__point *points, int *num_points, float x
    return 1;
 }

+// Flattening quadratic Beziers by Raph Levien
+// https://raphlinus.github.io/graphics/curves/2019/12/23/flatten-quadbez.html
+
+static float approx_myint(float x)
+{
+   const float d = 0.67f, dddd = d*d*d*d;
+   return x / (1.0f - d + STBTT_sqrt(STBTT_sqrt(dddd + 0.25f * x * x)));
+}
+
+static float approx_inv_myint(float x)
+{
+   const float b = 0.39f, bb = b*b;
+   return x * (1.0f - b + STBTT_sqrt(bb + 0.25f * x * x));
+}
+
+static int stbtt__tesselate_quad_raph(stbtt__point *points, int *num_points, float x0, float y0, float x1, float y1, float x2, float y2, float objspace_flatness)
+{
+   int i,n=1;
+   float u,t,ct,ex,ey;
+   // map_to_basic
+   float ddx = 2 * x1 - x0 - x2;
+   float ddy = 2 * y1 - y0 - y2;
+   float u0 = (x1 - x0) * ddx + (y1 - y0) * ddy;
+   float u2 = (x2 - x1) * ddx + (y2 - y1) * ddy;
+   float cross = (x2 - x0) * ddy - (y2 - y0) * ddx;
+   float bx0 = u0 / cross;
+   float bx2 = u2 / cross;
+   float scale = STBTT_fabs(cross) / (STBTT_sqrt(ddx*ddx+ddy*ddy) * STBTT_fabs(bx2 - bx0));
+   if (cross>0.001f && scale>0.001f) { // scale*cross?
+      // subdivide
+      float a0 = approx_myint(bx0);
+      float a2 = approx_myint(bx2);
+      n = STBTT_iceil(0.5f * STBTT_fabs(a2 - a0) * STBTT_sqrt(scale / objspace_flatness));
+      u0 = approx_inv_myint(a0); // reuse
+      u2 = approx_inv_myint(a2); // reuse
+      float a2a0 = a2 - a0;
+      float invu2u0 = 1.0f / (u2 - u0);
+      float invn = 1.0f / n;
+      for (i=1; i < n; ++i) {
+         u = approx_inv_myint(a0 + (a2a0 * i) * invn);
+         // Map x parameter back to t parameter for the original segment.
+         t = (u - u0) * invu2u0;
+         ct = 1.0f - t;
+         ex = t*t*x0 + 2.0f*t*ct*x1 + ct*ct*x2;
+         ey = t*t*y0 + 2.0f*t*ct*y1 + ct*ct*y2;
+         stbtt__add_point(points, *num_points + i - 1, ex,ey);
+      }
+   }
+   stbtt__add_point(points, *num_points + n - 1, x2,y2);
+   *num_points = *num_points + n;
+   return 1;
+}
+
 static void stbtt__tesselate_cubic(stbtt__point *points, int *num_points, float x0, float y0, float x1, float y1, float x2, float y2, float x3, float y3, float objspace_flatness_squared, int n)
 {
    // @TODO this "flatness" calculation is just made-up nonsense that seems to work well enough
@@ -3664,10 +3717,17 @@ static stbtt__point *stbtt_FlattenCurves(stbtt_vertex *vertices, int num_verts,
                stbtt__add_point(points, num_points++, x, y);
                break;
             case STBTT_vcurve:
+            #if 0
                stbtt__tesselate_curve(points, &num_points, x,y,
                                         vertices[i].cx, vertices[i].cy,
                                         vertices[i].x,  vertices[i].y,
                                         objspace_flatness_squared, 0);
+            #else
+               stbtt__tesselate_quad_raph(points, &num_points, x,y,
+                                        vertices[i].cx, vertices[i].cy,
+                                        vertices[i].x,  vertices[i].y,
+                                        objspace_flatness);
+            #endif
                x = vertices[i].x, y = vertices[i].y;
                break;
             case STBTT_vcubic:
@@ -3710,6 +3770,8 @@ STBTT_DEF void stbtt_FreeBitmap(unsigned char *bitmap, void *userdata)
    STBTT_free(bitmap, userdata);
 }

+#define STBTT_FLATTEN_TOLERANCE 0.01f
+
 STBTT_DEF unsigned char *stbtt_GetGlyphBitmapSubpixel(const stbtt_fontinfo *info, float scale_x, float scale_y, float shift_x, float shift_y, int glyph, int *width, int *height, int *xoff, int *yoff)
 {
    int ix0,iy0,ix1,iy1;
@@ -3743,7 +3805,7 @@ STBTT_DEF unsigned char *stbtt_GetGlyphBitmapSubpixel(const stbtt_fontinfo *info
       if (gbm.pixels) {
          gbm.stride = gbm.w;

-         stbtt_Rasterize(&gbm, 0.35f, vertices, num_verts, scale_x, scale_y, shift_x, shift_y, ix0, iy0, 1, info->userdata);
+         stbtt_Rasterize(&gbm, STBTT_FLATTEN_TOLERANCE, vertices, num_verts, scale_x, scale_y, shift_x, shift_y, ix0, iy0, 1, info->userdata);
       }
    }
    STBTT_free(vertices, info->userdata);
@@ -3769,7 +3831,7 @@ STBTT_DEF void stbtt_MakeGlyphBitmapSubpixel(const stbtt_fontinfo *info, unsigne
    gbm.stride = out_stride;

    if (gbm.w && gbm.h)
-      stbtt_Rasterize(&gbm, 0.35f, vertices, num_verts, scale_x, scale_y, shift_x, shift_y, ix0,iy0, 1, info->userdata);
+      stbtt_Rasterize(&gbm, STBTT_FLATTEN_TOLERANCE, vertices, num_verts, scale_x, scale_y, shift_x, shift_y, ix0,iy0, 1, info->userdata);

    STBTT_free(vertices, info->userdata);
 }

the numeric guards might need some care.

The thing is, you're asking me to replace code that I know works with code that I don't know works. Maybe it works correctly, but maybe there's some corner case where it performs badly at that we haven't tested. So there needs to be a big benefit for it to be worth that risk.

So, for example, handwaving the guards against divide by zeroes isn't very satisfactory. In the original code, map_to_basic was entirely unguarded! In the new code, you never actually compute the second denominator as a standalone value, you just check that the result is above a threshhold, so I'm not sure why that can't blow up to infinity. Maybe it can't and deeper analysis would show that, but...

But the point is I don't want to spend the time looking at the web page and looking at your code and testing it and evaluating it and doing a deeper analysis if there isn't a significant benefit.

// one more patch
float scale = cross / (ndd * STBTT_fabs(u2 - u0)) * cross; // elegant

yes i admit that numerical guards here might be bit of difficult. solely guarding against cross or scale isn't reliable. my deeper experimental analysis suggests that throughout the algorithm there are at least 3 singularities for the off-curve point (two endpoint and a midpoint) and i can't find a clever way to handle them all yet (tried area/perimeter but the perimeter calculation might be too expensive). will later see if there is one...

nothings / stb

stb_truetype enhancement: more elegant curve flattening #1473