परमेश्‍वर (parameshvar) incorrectly displayed

pgundlach commented 2 years ago

<Layout xmlns="urn:speedata.de:2009/publisher/en"
    xmlns:sd="urn:speedata:2009/publisher/functions/en">
    <LoadFontfile name="Noto" filename="NotoSansDevanagari-Regular.ttf" />
    <Pageformat height="4cm" width="5cm" />
    <DefineFontfamily fontsize="10" leading="12" name="text">
        <Regular fontface="Noto" />
    </DefineFontfamily>
    <Record element="data">
        <PlaceObject>
            <Textblock>
                <Paragraph language="Other">
                    <Value>परमेश्‍वर</Value>
                </Paragraph>
            </Textblock>
        </PlaceObject>
    </Record>
</Layout>

incorrect result:

pgundlach commented 2 years ago

The word परमेश्वर (Parmeshwar - God) consists of the following Unicode input sequence:

Now I assume that the highlighted zero width joiner (U+200D, ZWJ) is used to connect the first and the last part of the word in order to keep it together.

From

https://en.wikipedia.org/wiki/Zero-width_joiner

"When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms."

Harfbuzz, the font shaping library, turns this into (the font dependent) unicode glyphs:

The ZWJ gets converted into a space with width 0, the speedata Publisher incorrectly interprets this as an interword space. My fix is to ignore zero width spaces.

khaledhosny commented 2 years ago

Fonts might also Kern space, or use different space glyph for different scripts. I suggest:

Detect word space based on the input character, i.e. using the cluster. This backmap seems to assume there is a unique relationship between glyphs and code points, which is not true in OpenType. Doing this alone would have avoided this issue in the first place, since even though this is a space glyph, the cluster points to an input ZWJ.
Use the x advance of the shaped glyph for the width of the glue not a fixed font global space width.

khaledhosny commented 2 years ago

Here are a few test fonts:

Aref Ruqaa Uses different space glyphs with different advance widths for Arabic and Latin:

$ hb-shape --text="ا ا" aref-ruqaa/ArefRuqaa-Regular.otf
[alef-ar=2+248|space=1+146|alef-ar=0+248]
$ hb-shape --text="a a" aref-ruqaa/ArefRuqaa-Regular.otf
[a=0+548|space#1=1+333|a=2+548]

Qahiri Uses different space glyphs like Aref Ruqaa, and additionally the Arabic space (which happens to be the default) has zero advance width. You still want glue here to allow for line breaks:
```
$ hb-shape --text="ا ا" qahiri/Qahiri-Regular.otf
[alef-ar=2+360|space=1+0|alef-ar=0+360]
$ hb-shape --text="a a" qahiri/Qahiri-Regular.otf
[a=0+240|space.latn=1+200|a=2+240]
```

pgundlach commented 2 years ago

Fonts might also Kern space, or use different space glyph for different scripts. I suggest:

Detect word space based on the input character, i.e. using the cluster. This backmap seems to assume there is a unique relationship between glyphs and code points, which is not true in OpenType. Doing this alone would have avoided this issue in the first place, since even though this is a space glyph, the cluster points to an input ZWJ.

Use the x advance of the shaped glyph for the width of the glue not a fixed font global space width.

I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.

Level 0:

hb-shape --cluster-level=0 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=2+0|shaprehalfdeva=4+407|space=4+0|vadeva=7+556|radeva=8+409]

Level 1:

hb-shape --cluster-level=1 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=3+0|shaprehalfdeva=4+407|space=6+0|vadeva=7+556|radeva=8+409]

which has the correct connection between input cluster 6 (ZWJ) and the output space.

pgundlach commented 2 years ago

@khaledhosny thank you for letting me re-think the shaping.

khaledhosny commented 2 years ago

HarfBuzz clusters are a bit tricky because they are used to convey complex code points to glyphs relationships:

one-to-one: the simplest and the most common, one input code point maps to one output glyph.
many-to-one (AKA ligatures): more than one input code point maps to one output glyph. This can be detected by presence of gaps in the clusters of output glyphs: c0,c1,c2,c3 -> g0=0,g1=1,g2=3 This gap in clusters indicates that g1 is a ligature of c1 and c2.
one-to-many: one input code point maps to more than one glyph. This is detected by a sequence of glyphs all having the same cluster" c0,c1,c2 -> g0=0,g1=0,g2=0,g3=1,g4=2 This means glyphs g0, g1, and g2 all correspond to c0
many-to-many: more than one input code point map to more than one glyph. This is detected by a sequence of glyphs all having the same cluster and a gap in the clusters: c0,c1,c2 -> g0=0,g1=0,g2=0,g3=2 This means glyphs g0, g1, and g2 all belong to c0 and c1. In cluster level 2 this happens manly when there is reordering since HarfBuzz promises that clusters will be in ascending order, it will merge them in case of reordering to ensure this. In cluster level 0, this also happens with combining marks (and other thinks it things should be one unit) even if there is no reordering, and it is the main reason I prefer cluster level 1 myself as it allows more granularity. Cluster level 2 is a wild beast and I blissfully ignore it.

I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.

See above for explanation. It wouldn’t matter for the issue at hand, though. In cluster level 0 the ZWJ is merged with the glyphs before it so they now are one unit and you wouldn’t be adding glue here anyway. Even if a actual space code point got merged like this, you don’t want to add a glue unless the space is the last code point in the cluster (say a font turns a few words into a ligature, you don’t want to insert glue after the ligature for each space code point).

pgundlach commented 2 years ago

Thank you for the explanation.

So I think I'll use a combination of the following:

Use cluster level > 0
Insert a space only if the cluster contains a space at this point
Ignore ZWJ and similar code points in the output

khaledhosny commented 2 years ago

Use cluster level > 0

Cluster level 1, cluster level 2 does not grantee that that values will be in ascending order, so You might see 0, 3, 1, 4, 2, or whatever.

Ignore ZWJ and similar code points in the output

They get zero width space by default, but if you check space code points using the clusters you shouldn’t need any special handling for them (you can also configure what glyph id harfbuzz uses for them or even make HarfBuzz drop them altogether, if you prefer that).

pgundlach commented 2 years ago

Khaled, thanks again very much for you valuable comments. I have now used the flag HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES to remove these from the glyphs.

speedata / publisher

परमेश्‍वर (parameshvar) incorrectly displayed #369