Open Manishearth opened 1 year ago
Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.
The answer is "no", and the spec has something to say about this:
When applying a rule to an isolating run sequence, the last character of each level run in the isolating run sequence is treated as if it were immediately followed by the first character in the next level run in the sequence, if any.
This thread reminds me a bit of https://github.com/google/diff-match-patch/pull/13 and format_to_parts.md and others. Basically, what is the best way to store character annotations in UTF-8 or UTF-16?
A few general approaches:
This is something I kinda discovered whilst working on https://github.com/servo/unicode-bidi/pull/85
The majority of the bidi algorithm involves taking the set of original bidi classes, going through them, and overriding some of them. To do this, unicode-bidi passes around a
processing_classes: &mut [BidiClass]
that starts out as a clone oforiginal_classes: &[BidiClass]
, a slice mapping byte indices to the classes of each character.Of course, non-character boundary byte indices (I'm going to call these "off-bytes") are kinda meaningless in the context of the bidi algorithm.
original_classes
starts out with copied bidi classes for these, but we have to be careful about both maintaining this property and also not accidentally treating byte indices as property indices.Further code is inconsistent about iterating over characters or bytes, and it's tricky to see if it's updating off-bytes consistently.
Analysis
This is a writeup of my process walking through the code to see what breaks due to this. TLDR: the property is maintained (often rather subtly) but iterating over bytes causes at least one bug (https://github.com/servo/unicode-bidi/pull/87), and it also makes stuff annoying to reason about when editing the code. Feel free to take my word for it and skip this section.
Moving forward
I think there are a couple issues here. We have at least one bug, and this property is excruciating to reason about and rather fragile. Furthermore, we're iterating over a lot of unnecessary bytes, which might be extra work?
We can do a couple things here:
char_indices()
everywhere, and just letting the off-bytes in the classes arrays be dead space. This would mostly simplify the algorithms, but at the cost of using character indicing which is a bit more expensive. On the other hand, not needing to iterate each and every byte might be a win. Worth looking into from a perf level.processing_classes
in a way that makes per-byte mutation impossible (while still maintaining easy read indexing), and also pepper the code with comments about this so that additional state in loops is maintained correctly.Thoughts? @eggrobin @sffc @mbrubeck @behnam
[^1]: Written cleaner as
sequence.runs.iter().flat_map(Clone::clone)
, which we do inresolve_neutral()
[^2]: Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.