open-i18n / rust-unic

UNIC: Unicode and Internationalization Crates for Rust
https://crates.io/crates/unic
Other
234 stars 24 forks source link

[unic-bidi] API concerns #273

Open raphlinus opened 3 years ago

raphlinus commented 3 years ago

I'm looking at using unic-bidi for a contemplated text layout project. For this discussion, consider the primary goal to be adding BiDi support to a (hypothetical) in-Rust implementation of text layout conforming to Piet's TextLayout trait. I see a number of concerns.

As a higher level process issue, one way to address those concerns is to fork unic-bidi. But I see value in trying to maintain a single source of truth for BiDi in the Rust ecosystem, so it feels like working with upstream is a good way to do that. That said, #272 is pretty good evidence to me, in addition to reverse dependencies (note that the RustPython reverse dep has actually recently been removed) that nobody is actually using unic-bidi, especially within the context of paragraph level text formatting.

Now for my concerns.

A big one is the &'text lifetime requirement on BidiInfo. That precludes retaining BidiInfo as part of our text layout object. One pattern for fixing this is to change the type parameter from <'text> to AsRef<str>, which would let the text be borrowed or owned, at the caller's request. Another way to fix it is to simply clone the string (it's not as if the implementation is particularly clone-averse). But I'm not convinced we actually need to retain the string. If you look at the methods on BidiInfo, I'm not sure reorder_line() is actually still valid in a modern environment; if my understanding is correct, HarfBuzz always wants its input in logical order. The string is also used in reorder_levels_per_char() to find the codepoint boundaries (I'm not sure this function is useful), and in visual_runs(), again mostly to find codepoint boundaries, but I think the "reset" logic can be run entirely from original_classes without making reference to the text.

Other concerns center around performance. I'm not advocating doing a lot of performance work now, but I also believe that the current API locks in certain decisions that make such performance work difficult in the future. That's mostly around the per-byte (technically, per UTF-8 code unit) arrays to represent things like levels. I think these don't represent the way people typically work with text in Rust, and in any case there are about twice as many UTF-8 code units as UTF-16 for most BiDi work.

So what I'd like better is for visual_runs to return an array of runs, where each run is a range in the text and a level (the latter would be after the reset logic is applied). I think the various "reorder" functions should go, as I don't see them as being particularly useful, and if they are* needed, I think their function can be written quite reasonably in terms of this proposed visual_runs.

Lastly, the stuff around ParagraphInfo seems crufty to me. Why is this passed in? It's possible to infer from the text position. Taking a step back, it's not clear to me that segmenting into paragraphs is a useful function for unic-bidi in the first place. It seems extremely likely that its client will have already done segmentation into paragraphs by the time it gets to BiDi analysis. If the goal of the crate is to cover all of UAX#9, then perhaps paragraph segmentation can be provided through a separate API. Then the scope of BidiInfo could be narrowed to a single paragraph, which I think would be a desirable simplification.

I post this in part to provoke discussion from potential other clients of the crate, and also to test whether unic is a good container for the work.

sffc commented 3 years ago

Hi @raphlinus,

It's worth discussing whether BiDi might be in scope for ICU4X. See more info in #274. I encourage you to start an issue on our repository:

https://github.com/unicode-org/icu4x/issues

dhardy commented 3 years ago

Agreed, this would be worth discussing. See also: https://github.com/kas-gui/kas-text/issues/20

Two tools are needed here in my opinion:

  1. A tool to resolve the embedding levels according to TR9; this is provided by unciode-bidi and unic-bidi but the implementation is imperfect (e.g. https://github.com/servo/unicode-bidi/issues/8)
  2. https://crates.io/crates/unicode-bidi-mirroring (which works perfectly according to my limited testing)
raphlinus commented 3 years ago

I'm curious why you need the mirroring. Shouldn't HarfBuzz take care of that for you as part of its RTL shaping? It's entirely possible there's something I'm missing though.

dhardy commented 3 years ago

HarfBuzz does take care of it, but I also maintain a simple (Kerning only) alternative, and that uses the mirroring.

behnam commented 3 years ago

Thanks for the detailed report, @raphlinus; and others for valuable comments.

The state of Bidi Algorithm impl in unic-bidi (and unicode-bidi package, similarly) is indeed far from ideal. This is definitely not because of lack of interest, but because in my exploration on the solution space, I ended up noticing similar challenges around the current API, as explained in this issue.

Sharing some context: I put together the BidiInfo structure together without much API design deliberation, only to solve the specific problem I was dealing with in Servo. (And the main reason there's both unic-bidi and unicode-bidi packages is the fact that I want to change the API of unic-bidi drastically, which Servo may not need for the time being.)

To give you a super-quick update on the my current focus on UNIC (at least for the next few weeks) is to get it back on track for forward-development work, with the goal to get a proper Text API in place, supporting segmentation, bidi and joining control. (From there, would be possible to get to line-break, as well. I haven't plans for myself for that, yet.) One of the changes needed on this road for me, is to split the "explicit bidi run resolution" part of the Bidi Algorithm working under no-std.

And, of course, there's a lot more work to be done for Bidi impl here, in terms of conformance, and performance (specifically, getting to linear or near-linear impl, possibly under no-std conditions).

Now, whether this project is a good place for such work? Well, I have been open about my personal agenda (which is focus on Multilingual Text API), and others have been contributing per their own goals and agenda, sharing what they're comfortable with. I understand well that there's been some confusions because of my lack of resources to get to work areas that others are interested in, and there's not really much I can do about that. My attempt is to push for an initiative where everyone can participate, and every language gets a fair share of support across all the layers of the technology built. I would be more than happy to hear your thoughts on this. :)

dhardy commented 3 years ago

From my point of view, unic-bidi should determine embedding levels, and the rest is irrelevant. Why? We have other tools (like xi-unicode for line-breaking and unicode-bidi-mirroring) which handle most of the other jobs which can be done separately already. Re-ordering can only be done after determining line-wrap points which can only be determined after shaping text (my waffle on this subject); it's also not very difficult once the other jobs have been done.