speedata / publisher

speedata Publisher - a professional database Publishing system
https://www.speedata.de/
GNU Affero General Public License v3.0
296 stars 36 forks source link

परमेश्‍वर (parameshvar) incorrectly displayed #369

Closed pgundlach closed 2 years ago

pgundlach commented 2 years ago
<Layout xmlns="urn:speedata.de:2009/publisher/en"
    xmlns:sd="urn:speedata:2009/publisher/functions/en">
    <LoadFontfile name="Noto" filename="NotoSansDevanagari-Regular.ttf" />
    <Pageformat height="4cm" width="5cm" />
    <DefineFontfamily fontsize="10" leading="12" name="text">
        <Regular fontface="Noto" />
    </DefineFontfamily>
    <Record element="data">
        <PlaceObject>
            <Textblock>
                <Paragraph language="Other">
                    <Value>परमेश्‍वर</Value>
                </Paragraph>
            </Textblock>
        </PlaceObject>
    </Record>
</Layout>

incorrect result:

god
pgundlach commented 2 years ago

The word परमेश्वर (Parmeshwar - God) consists of the following Unicode input sequence:

input

Now I assume that the highlighted zero width joiner (U+200D, ZWJ) is used to connect the first and the last part of the word in order to keep it together.

From

https://en.wikipedia.org/wiki/Zero-width_joiner

"When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms."

Harfbuzz, the font shaping library, turns this into (the font dependent) unicode glyphs:

output

The ZWJ gets converted into a space with width 0, the speedata Publisher incorrectly interprets this as an interword space. My fix is to ignore zero width spaces.

khaledhosny commented 2 years ago

Fonts might also Kern space, or use different space glyph for different scripts. I suggest:

  1. Detect word space based on the input character, i.e. using the cluster. This backmap seems to assume there is a unique relationship between glyphs and code points, which is not true in OpenType. Doing this alone would have avoided this issue in the first place, since even though this is a space glyph, the cluster points to an input ZWJ.
  2. Use the x advance of the shaped glyph for the width of the glue not a fixed font global space width.
khaledhosny commented 2 years ago

Here are a few test fonts:

pgundlach commented 2 years ago

Fonts might also Kern space, or use different space glyph for different scripts. I suggest:

  1. Detect word space based on the input character, i.e. using the cluster. This backmap seems to assume there is a unique relationship between glyphs and code points, which is not true in OpenType. Doing this alone would have avoided this issue in the first place, since even though this is a space glyph, the cluster points to an input ZWJ.
  2. Use the x advance of the shaped glyph for the width of the glue not a fixed font global space width.

I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.

Level 0:

hb-shape --cluster-level=0 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=2+0|shaprehalfdeva=4+407|space=4+0|vadeva=7+556|radeva=8+409]

Level 1:

hb-shape --cluster-level=1 --show-unicode --text="परमेश्<200d>वर" spielwiese/fonts/NotoSansDevanagari-Regular.ttf
<U+092A=0|U+0930=1|U+092E=2|U+0947=3|U+0936=4|U+094D=5|U+200D=6|U+0935=7|U+0930=8>
[padeva=0+568|radeva=1+409|madeva=2+598|evowelsigndeva=3+0|shaprehalfdeva=4+407|space=6+0|vadeva=7+556|radeva=8+409]

which has the correct connection between input cluster 6 (ZWJ) and the output space.

pgundlach commented 2 years ago

@khaledhosny thank you for letting me re-think the shaping.

khaledhosny commented 2 years ago

HarfBuzz clusters are a bit tricky because they are used to convey complex code points to glyphs relationships:

  1. one-to-one: the simplest and the most common, one input code point maps to one output glyph.
  2. many-to-one (AKA ligatures): more than one input code point maps to one output glyph. This can be detected by presence of gaps in the clusters of output glyphs: c0,c1,c2,c3 -> g0=0,g1=1,g2=3 This gap in clusters indicates that g1 is a ligature of c1 and c2.
  3. one-to-many: one input code point maps to more than one glyph. This is detected by a sequence of glyphs all having the same cluster" c0,c1,c2 -> g0=0,g1=0,g2=0,g3=1,g4=2 This means glyphs g0, g1, and g2 all correspond to c0
  4. many-to-many: more than one input code point map to more than one glyph. This is detected by a sequence of glyphs all having the same cluster and a gap in the clusters: c0,c1,c2 -> g0=0,g1=0,g2=0,g3=2 This means glyphs g0, g1, and g2 all belong to c0 and c1. In cluster level 2 this happens manly when there is reordering since HarfBuzz promises that clusters will be in ascending order, it will merge them in case of reordering to ensure this. In cluster level 0, this also happens with combining marks (and other thinks it things should be one unit) even if there is no reordering, and it is the main reason I prefer cluster level 1 myself as it allows more granularity. Cluster level 2 is a wild beast and I blissfully ignore it.

I assume that I have the incorrect cluster level. Setting it to 1 or 2 seems to connect the space to the ZWJ.

See above for explanation. It wouldn’t matter for the issue at hand, though. In cluster level 0 the ZWJ is merged with the glyphs before it so they now are one unit and you wouldn’t be adding glue here anyway. Even if a actual space code point got merged like this, you don’t want to add a glue unless the space is the last code point in the cluster (say a font turns a few words into a ligature, you don’t want to insert glue after the ligature for each space code point).

pgundlach commented 2 years ago

Thank you for the explanation.

So I think I'll use a combination of the following:

khaledhosny commented 2 years ago
  • Use cluster level > 0

Cluster level 1, cluster level 2 does not grantee that that values will be in ascending order, so You might see 0, 3, 1, 4, 2, or whatever.

  • Ignore ZWJ and similar code points in the output

They get zero width space by default, but if you check space code points using the clusters you shouldn’t need any special handling for them (you can also configure what glyph id harfbuzz uses for them or even make HarfBuzz drop them altogether, if you prefer that).

pgundlach commented 2 years ago

Khaled, thanks again very much for you valuable comments. I have now used the flag HB_BUFFER_FLAG_REMOVE_DEFAULT_IGNORABLES to remove these from the glyphs.