Closed jfkthame closed 1 year ago
Yeah, I'm fine with this, though I may not have time to review it soon.
In general I would like this crate to be encoding agnostic (and also be able to support e.g. ill-formed UTF8).
A thing I would like to see solved here is #86: whatever we do to implement this should abstract over indexing well enough that we no longer need to care about it.
I think we could easily adapt this to handle ill-formed UTF-8. We'd need to create an alternative API using [u8]
instead of str
; then we provide a suitable implementation of TextSource
for [u8]
, and it should "just work".
We could then make the existing str
API into a trivial shim on top of the [u8]
API, provided the additional validity-checking is cheap enough to ignore.
To make it easier & more efficient to use unicode-bidi in an environment (such as Gecko) where text is handled as UTF-16, I would like to extend the API here to provide a UTF-16 interface, and do the processing directly on UTF-16 code units as an alternative to UTF-8 code units (bytes).
This would not change the existing API in any way, or affect existing users.
Proposal:
Introduce versions of the
BidiInfo
andInitialInfo
structs where thetext
field is&[u16]
instead of&str
. I'm suggesting these could be namedBidiInfoU16
andInitialInfoU16
. Except for the type of theirtext
, these will be identical to the existing UTF-8-based versions.We'll also need
ParagraphU16
, because itsinfo
will be a&BidiInfoU16
.To allow the actual implementation of the bidi algorithm to be shared between the 8- and 16-bit versions of these structs, I propose a
TextSource
trait that abstracts access to and iteration over the text, with implementations forstr
and for[u16]
. Only minor adaptation of theInitialInfo
,BidiInfo
, andParagraph
methods is needed to work with this.@Manishearth Does this sound like a reasonable way forward? I have a prototype implementation working locally, which I can put up as a PR for review if you think the overall idea is acceptable.
One factor to consider is that while we know, when using the
str
-based API, that the text must be well-formed Unicode, this will not be the case for a[u16]
-based API; there could be unpaired surrogate code units present. There are a few ways we could handle this:(a) Require the text to be valid UTF-16;
panic!()
if unpaired surrogates are encountered (b) Have the 16-bit methods returnResult()
s everywhere, so that invalid text can return an error (c) Treat any unpaired surrogate asREPLACEMENT_CHARACTER
for all bidi processingI'm currently leaning toward (c), but happy to listen to arguments for other options.