Getting info of code points after shaping.

wdanilo commented 2 years ago

Hi, I want to use rustybuzz, but I have one issue after browsing the API. I want to provide it with an Unicode buffer and get back the glyph buffer. However, I want to know the indexes each glyph corresponds to the input Unicode buffer. The reason for that is that I'm also using the xi-rope crate, which allows me to style the text (e.g. set colors) based on the Unicode indexes from the input buffer. Unfortunately, I don't see any API that would allow me to do it – is it possible?

RazrFalcon commented 2 years ago

I don't think it's possible in a general case. Shaping involves multiple transformations to the input "string" and tracking them would be rather hard and expensive.

For example, two codepoints can be replaced by one glyph and vice-versa. What output would you expect in this case?

The closest thing we have is GlyphInfo::cluster, which by default should contain a byte/UTF-8 offset in the original string. For example:

> cargo run --example shape -- --utf8-clusters --no-glyph-names --no-positions Amiri-Regular.ttf "هتاف للترحيب"
2009=21|2065=19|2160=17|3293=15|3265=13|5296=11|5290=9|3=8|414=6|1962=4|2078=2|2198=0

Where the first number is GlyphID and the second one is byte offset in the original string. Note that due to BIDI reordering glyphs are reversed.

Or here is an another example:

> cargo run --example shape -- --utf8-clusters --no-positions '/System/Library/Fonts/Times.ttc' "final"
fi=0|n=2|a=3|l=4

As you can see, fi was replaced by a single glyph. But "character" at offset 0 is f, not fi. You can technically recover i here, but it's all up to you.

laurmaedje commented 2 years ago

Going into a bit more detail on the clusters: As mentioned, multiple glyphs may merge into a ligature or one codepoint may result in multiple glyphs etc. When this happens, the glyphs form a cluster and they become an inseparable unit. As a result, all glyphs in a cluster also share the index at which the whole thing starts in the source text: This is what is stored in the cluster field. You shouldn't try to infer anything in between clusters, they are one unit for all intents and purposes (e.g., cursor movement).

To find out which piece of text a cluster spans you have to look at the next cluster index that is different. Also, if you want to handle RTL text properly, you have to be extra careful because the cluster indices are suddenly reversed. For more details, also check the HarfBuzz documentation.

wdanilo commented 2 years ago

@RazrFalcon and @laurmaedje I can't express how thankful I am for such an amazing explanation. I believe it should be included in the docs as well. TBH I didn't know that the second cluster value is the byte/UTF-8 offset, this is exactly what I need. After shaping the final clusters I just want to know their start/end byte offset to color them etc, so it seems that clusters are exactly what I'm looking for. I was reading all the docs and I haven't gotten this info from them. Thank you thousand times for your help, I really appreciate it ❤️

RazrFalcon commented 2 years ago

No problem. And yes, docs need a lot of improvement. But frankly, shaping is such a niche thing, that a caller is probably already knows what they are doing. Most people doesn't even know what shaping is.

harfbuzz / rustybuzz

Getting info of code points after shaping. #51