Closed pgundlach closed 2 years ago
The word परमेश्वर (Parmeshwar - God) consists of the following Unicode input sequence:
Now I assume that the highlighted zero width joiner (U+200D, ZWJ) is used to connect the first and the last part of the word in order to keep it together.
From
https://en.wikipedia.org/wiki/Zero-width_joiner
"When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms."
Harfbuzz, the font shaping library, turns this into (the font dependent) unicode glyphs:
The ZWJ gets converted into a space with width 0, the speedata Publisher incorrectly interprets this as an interword space. My fix is to ignore zero width spaces.
Fonts might also Kern space, or use different space glyph for different scripts. I suggest:
Here are a few test fonts:
$ hb-shape --text="ا ا" aref-ruqaa/ArefRuqaa-Regular.otf
[alef-ar=2+248|space=1+146|alef-ar=0+248]
$ hb-shape --text="a a" aref-ruqaa/ArefRuqaa-Regular.otf
[a=0+548|space#1=1+333|a=2+548]
$ hb-shape --text="ا ا" qahiri/Qahiri-Regular.otf
[alef-ar=2+360|space=1+0|alef-ar=0+360]
$ hb-shape --text="a a" qahiri/Qahiri-Regular.otf
[a=0+240|space.latn=1+200|a=2+240]
Fonts might also Kern space, or use different space glyph for different scripts. I suggest:
- Detect word space based on the input character, i.e. using the cluster. This backmap seems to assume there is a unique relationship between glyphs and code points, which is not true in OpenType. Doing this alone would have avoided this issue in the first place, since even though this is a space glyph, the cluster points to an input ZWJ.
- Use the x advance of the shaped glyph for the width of the glue not a fixed font global space width.
I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.
Level 0:
hb-shape --cluster-level=0 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=2+0|shaprehalfdeva=4+407|space=4+0|vadeva=7+556|radeva=8+409]
Level 1:
hb-shape --cluster-level=1 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=3+0|shaprehalfdeva=4+407|space=6+0|vadeva=7+556|radeva=8+409]
which has the correct connection between input cluster 6 (ZWJ) and the output space.
@khaledhosny thank you for letting me re-think the shaping.
HarfBuzz clusters are a bit tricky because they are used to convey complex code points to glyphs relationships:
c0,c1,c2,c3 -> g0=0,g1=1,g2=3
This gap in clusters indicates that g1
is a ligature of c1
and c2
.c0,c1,c2 -> g0=0,g1=0,g2=0,g3=1,g4=2
This means glyphs g0
, g1
, and g2
all correspond to c0
c0,c1,c2 -> g0=0,g1=0,g2=0,g3=2
This means glyphs g0
, g1
, and g2
all belong to c0
and c1
.
In cluster level 2 this happens manly when there is reordering since HarfBuzz promises that clusters will be in ascending order, it will merge them in case of reordering to ensure this. In cluster level 0, this also happens with combining marks (and other thinks it things should be one unit) even if there is no reordering, and it is the main reason I prefer cluster level 1 myself as it allows more granularity. Cluster level 2 is a wild beast and I blissfully ignore it.I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.
See above for explanation. It wouldn’t matter for the issue at hand, though. In cluster level 0 the ZWJ is merged with the glyphs before it so they now are one unit and you wouldn’t be adding glue here anyway. Even if a actual space code point got merged like this, you don’t want to add a glue unless the space is the last code point in the cluster (say a font turns a few words into a ligature, you don’t want to insert glue after the ligature for each space code point).
Thank you for the explanation.
So I think I'll use a combination of the following:
- Use cluster level > 0
Cluster level 1, cluster level 2 does not grantee that that values will be in ascending order, so You might see 0, 3, 1, 4, 2, or whatever.
- Ignore ZWJ and similar code points in the output
They get zero width space by default, but if you check space code points using the clusters you shouldn’t need any special handling for them (you can also configure what glyph id harfbuzz uses for them or even make HarfBuzz drop them altogether, if you prefer that).
Khaled, thanks again very much for you valuable comments. I have now used the flag HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES
to remove these from the glyphs.
incorrect result: