servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
78 stars 33 forks source link

Native UTF-16 support #108

Closed jfkthame closed 1 year ago

jfkthame commented 1 year ago

To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).

This would not change the existing API in any way, or affect existing users.

Proposal:

Introduce versions of the BidiInfo and InitialInfo structs where the text field is &[u16] instead of &str. I'm suggesting these could be named BidiInfoU16 and InitialInfoU16. Except for the type of their text, these will be identical to the existing UTF-8-based versions.

We'll also need ParagraphU16, because its info will be a &BidiInfoU16.

To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a TextSource trait that abstracts access to and iteration over the text, with implementations for str and for [u16]. Only minor adaptation of the InitialInfo, BidiInfo, and Paragraph methods is needed to work with this.

@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.

One factor to consider is that while we know, when using the str-based API, that the text must be well-formed Unicode, this will not be the case for a [u16]-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:

(a) Require the text to be valid UTF-16; panic!() if unpaired surrogates are encountered (b) Have the 16-bit methods return Result()s everywhere, so that invalid text can return an error (c) Treat any unpaired surrogate as REPLACEMENT_CHARACTER for all bidi processing

I'm currently leaning toward (c), but happy to listen to arguments for other options.

Manishearth commented 1 year ago

Yeah, I'm fine with this, though I may not have time to review it soon.

In general I would like this crate to be encoding agnostic (and also be able to support e.g. ill-formed UTF8).

A thing I would like to see solved here is #86: whatever we do to implement this should abstract over indexing well enough that we no longer need to care about it.

jfkthame commented 1 year ago

I think we could easily adapt this to handle ill-formed UTF-8. We'd need to create an alternative API using [u8] instead of str; then we provide a suitable implementation of TextSource for [u8], and it should "just work".

We could then make the existing str API into a trivial shim on top of the [u8] API, provided the additional validity-checking is cheap enough to ignore.