servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
78 stars 33 forks source link

Inconsistency between iterating over byte and char indices #86

Open Manishearth opened 1 year ago

Manishearth commented 1 year ago

This is something I kinda discovered whilst working on https://github.com/servo/unicode-bidi/pull/85

The majority of the bidi algorithm involves taking the set of original bidi classes, going through them, and overriding some of them. To do this, unicode-bidi passes around a processing_classes: &mut [BidiClass] that starts out as a clone of original_classes: &[BidiClass], a slice mapping byte indices to the classes of each character.

Of course, non-character boundary byte indices (I'm going to call these "off-bytes") are kinda meaningless in the context of the bidi algorithm. original_classes starts out with copied bidi classes for these, but we have to be careful about both maintaining this property and also not accidentally treating byte indices as property indices.

Further code is inconsistent about iterating over characters or bytes, and it's tricky to see if it's updating off-bytes consistently.

Analysis

This is a writeup of my process walking through the code to see what breaks due to this. TLDR: the property is maintained (often rather subtly) but iterating over bytes causes at least one bug (https://github.com/servo/unicode-bidi/pull/87), and it also makes stuff annoying to reason about when editing the code. Feel free to take my word for it and skip this section.

`explicit::compute()` iterates over char_indices() and performs additional fixups later to fill in the copied classes, maintaining the invariant that off-bytes match the class of their character. Huzzah. We could probably do some kind of deferred update thing where that fixup is not run for each character. However, `resolve_weak()` iterates over `sequence.runs.iter().cloned().flat_map(id as fn(LevelRun) -> LevelRun);` [^1], which is a fancy way of saying "every byte index contained in each level run in the level run sequence"[^2]. This _seems_ great: we can continue maintaining the aforementioned invariant, but uh oh: this loop has some state! This loop tracks `prev_class` (and some other stuff), which won't necessarily be correct when processing the off-bytes of a multibyte character. This is fine in the following cases, and maintains the invariant: - W1: off-bytes will end up copying whatever happened for the main byte. seems fine. - W2: `last_strong_is_al` is maintained correctly so we're fine - W5 & W6: The rule is dealing with sequences of ENs, and off bytes are merely additional members of that sequence. W5 does check prev_class, but only for boundaries of EN sequences. However, it's broken for W4, which cares about *single* separators sandwiched between numbers, and multibyte separators will just not hit it. This is fixable by changing the code for that rule only. The invariant is still maintained, however, mostly because W4 is brokenly skipping the cases where the invariant would otherwise have problems. I don't yet have a testcase for this, but it ought to be easy. I have a fix at https://github.com/servo/unicode-bidi/pull/87 On to `resolve_neutral()`. The code I just added for N0 (https://github.com/servo/unicode-bidi/pull/85) does maintain the invariant. While we've shown `resolve_weak()` maintains the invariant, the code for N0 actually rather resilient to invariant-breaking in `resolve_weak()` because `resolve_weak()` doesn't affect the *presence* of strong classes: it only introduces new strong classes, and that's all we care about in N0: "which strong classes are in this substring" and "what's the first strong class in this substring, starting at the end". When updating the classes for parentheses, N0 updates all of the classes, and the NSM sequence code also updates everything since it's dealing with sequences. While N0 is fine here, it took me a couple passes to get right because I had to pay attention to what everyone else was doing. Great. What about N1 and N2? These are handled with the sequence-run iteration thing again, and also have some state. They look for NIs, handling sequences. They do not distinguish between sequences and individual characters, and they treat the entire sequence as a single unit to be updated, so they're also fine as long as the invariant has been maintained so far, which it seems to be. Finally, `resolve_levels()` iterates over levels, so it has no problems here. After that `processing_classes` is no longer relevant, thank heavens. `assign_levels_to_removed_chars()` is short and doesn't really do anything weird with `original_classes` so we're fine there too.

Moving forward

I think there are a couple issues here. We have at least one bug, and this property is excruciating to reason about and rather fragile. Furthermore, we're iterating over a lot of unnecessary bytes, which might be extra work?

We can do a couple things here:

Thoughts? @eggrobin @sffc @mbrubeck @behnam

[^1]: Written cleaner as sequence.runs.iter().flat_map(Clone::clone), which we do in resolve_neutral() [^2]: Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.

Manishearth commented 1 year ago

Are level runs in an isolating run sequence always adjacent? This would be simpler to reason about if we always treated isolating run sequences as a Single range instead of breaking it down further.

The answer is "no", and the spec has something to say about this:

When applying a rule to an isolating run sequence, the last character of each level run in the isolating run sequence is treated as if it were immediately followed by the first character in the next level run in the sequence, if any.

sffc commented 1 year ago

This thread reminds me a bit of https://github.com/google/diff-match-patch/pull/13 and format_to_parts.md and others. Basically, what is the best way to store character annotations in UTF-8 or UTF-16?

A few general approaches:

  1. Store byte-for-byte annotations.
    • Pro: Efficient lookup and cross-referencing.
    • Pro: Works well with insertions and deletions.
    • Con: Tricky to keep updated / in sync since trail bytes can get forgotten.
    • Con: Wasted space since annotations may occur multiple times for the same code point.
  2. Store annotations for only the first byte, leaving others empty.
    • Pro: Same as above
    • Con: Same as above, but maybe easier to keep updated?
  3. Store annotations on UTF-32 indices, keeping the original string as UTF-8 or UTF-16
    • Pro: Easy to access annotations by themselves
    • Pro: No bias towards UTF-8 versus UTF-16 (still needed for Java and JavaScript compatibility)
    • Pro: No wasted space
    • Con: May require linear-time lookup to cross-reference the string with its annotations
  4. Operate on UTF-32 strings instead of UTF-8 strings, converting at the boundary if necessary
    • Pro: Much easier to reason about; less likely for errors to crop up
    • Con: Requires allocating more memory
    • Con: Probably not as efficient (but maybe by not as much as one would think, since the algorithms get a lot simpler when you do it this way)
  5. Store annotations in their own special data structure such as a range list (see format_to_parts.md)
    • Pro: Can result in efficient lookup for annotations
    • Con: Trickier when handling insertions and deletions in the middle of the string; may be an expensive operation in the annotation structure