Add support for running BiDi algorithm on text with styling annotations.

ChrisLoer commented 6 years ago

Upgrade to Emscripten 1.38 (requires explicit WASM=0 to build asm.js)
Upgrade to ICU 62.1
Adds processStyledBidirectionalText that takes an array of style indices in parallel to the input text and returns results annotated with the correct (reordered) indices
Add basic test for processStyledBidirectionalText that that splits contiguous input style sections into discontinguous output sections
Regenerate checked-in build outputs.

Note the test strings in arabic.test.js are mind-bending and may break layout in your text viewer (that includes GitHub's preview: those arrays that look broken aren't).

The previous approach (which we still use for unstyled text) was:

Run BiDi on the whole paragaph
Determine line breaks
Run ICU's setLine with chosen line break points to prepare individual lines
Run ICU's writeReordered to correctly transform/reorder all logical runs within the line into visual runs.

The new approach for styled lines is:

Run BiDi on the whole paragaph
Determine line breaks
Run ICU's setLine with chosen line break points to prepare individual lines
Use ICU's getVisualRun to iterate over directional runs within a line in visual order. For each run:
- if it's LTR, just copy it into the output, no transformations are needed
- if it's RTL, iterate over it logically backwards (visually forwards), breaking into segments by style. For each segment, use ICU's writeReverse to reverse the section and apply transformations. This may change the length of the segment, but since the whole segment shares the same style, we can just apply that style to every character in whatever the reversed output is. Note that writeReverse is responsible for keeping combining characters together in the right (aka still "logical") order. If you changed style mid-combining character, this algorithm could move the combining character to the wrong place, and you would deserve whatever happened to you.

One open question in my mind is whether the style annotation interface makes sense (an array of annotations maintained in parallel with a flat string). Alternatively, we could represent the input in terms of sections (aka instead of representing a section "foo" with style 0 as ["foo", [0, 0, 0]], we could represent it as ["foo",0]). Internally, it was easier for me to work with a data structure that made it super-easy to go from "source string logical index" -> "style", but it'd be easy enough to do the conversion. The thing is, I'm not sure an annotated-per-section interface is actually easier to work with, because the caller then has to keep track of iterating over sections too, and it's confusing that a contiguous input section can actually be split up into multiple discontiguous output sections...

/cc @anandthakker

ChrisLoer commented 6 years ago

Oh also this is meant to be backwards compatible with gl-js in both directions. Old versions that load this should just ignore processStyledBidirectionalText. New versions of gl-js that load an old version of the rtl-text plugin will still be able to run BiDi for unstyled text, but won't run it for styled text (see shaping.js logic at https://github.com/mapbox/mapbox-gl-js/pull/6994).

ChrisLoer commented 6 years ago

It would presumably take less space, though, right? Not sure if the difference would be enough to matter...

Yeah, I'm going to start with the assumption that it doesn't matter. We're not holding onto these arrays, so the cost would just show up... I dunno, maybe with the wasm version of mapbox-gl-rtl-text the overhead of transferring the array would be greater? But it seems reasonable to ignore unless it actually shows up as significant in profiling.

ChrisLoer commented 6 years ago

Huh, ICU's download site appears to be down, which breaks the build. I'll try again later.

mapbox / mapbox-gl-rtl-text

Add support for running BiDi algorithm on text with styling annotations. #10