unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)
https://unidoc.io
Other
2.5k stars 249 forks source link

Render unicode font characters #400

Closed rogchap closed 3 years ago

rogchap commented 3 years ago

Description

I'm trying to render font glyph from the NotoEmoji.ttf font (Freely available from Google). NotoEmoji-Regular.ttf.zip

If I pass a string with the correct unicode point eg: \u3299 or \U00003299 this successfully renders. However if I use a unicode that needs extra bytes this fails to render. eg \U0001F60E

Expected Behavior

All available glyphs in the font to render correctly based on the unicode value.

Actual Behavior

No char is rendered

Attachments

font, _ = model.NewCompositePdfFontFromTTFFile("./NotoEmoji-Regular.ttf")
p := c.NewStyledParagraph()
t := p.Append("\U0001F60E")
t.Style.Font = font
c.Draw(p)

Ps. I'm currently evaluating unipdf before buying a license (including unioffice)

github-actions[bot] commented 3 years ago

Welcome! Thanks for posting your first issue. The way things work here is that while customer issues are prioritized, other issues go into our backlog where they are assessed and fitted into the roadmap when suitable. If you need to get this done, consider buying a license which also enables you to use it in your commercial products. More information can be found on https://unidoc.io/

rogchap commented 3 years ago

Digging into this some more I can see that the rune's are not being parsed by the ttf parser, and that codeToUnicode is only 135 (rather than t the total 888):

[DEBUG]  truetype.go:130 Missing rune 127827 ('\U0001f353') from encoding
[DEBUG]  styled_paragraph.go:801 unsupported rune in text encoding: 0x1f353 (🍓)
rogchap commented 3 years ago

A further update:

So the NotoEmoji font has a 2 cmap sub tables: (3,1) (which is type 4) and (3,10) which is type 12. The first table contains the first 135 glyphs and the rest are in the (3,10) sub table that is not parsed.

I update the code to parse this cmap table:

func (t *ttfParser) ParseCmap() error {
// ...
    numTables := int(t.ReadUShort())
    offset10 := int64(0)
    offset31 := int64(0)
    offset310 := int64(0)
    fmt.Printf("numTables = %+v\n", numTables)
    for j := 0; j < numTables; j++ {
        platformID := t.ReadUShort()
        encodingID := t.ReadUShort()
        offset = int64(t.ReadULong())
        fmt.Printf("platformID, encodingID, offset = %+v, %v, %v\n", platformID, encodingID, offset)
        if platformID == 3 && encodingID == 1 {
            // (3,1) subtable. Windows Unicode.
            offset31 = offset
        } else if platformID == 3 && encodingID == 10 {
            offset310 = offset
        } else if platformID == 1 && encodingID == 0 {
            offset10 = offset
        }
    }

    // Many non-Latin fonts (including asian fonts) use subtable (1,0).
    if offset10 != 0 {
        if err := t.parseCmapVersion(offset10); err != nil {
            return err
        }
    }

    // Latin font support based on (3,1) table encoding.
    if offset31 != 0 {
        if err := t.parseCmapSubtable31(offset31); err != nil {
            return err
        }
    }

    // emoji support
    if offset310 != 0 {
        if err := t.parseCmapVersion(offset310); err != nil {
            return err
        }
    }
// ...
}

This resulted in more of the glyphs being added into the t.rec.Chars map; but not all available glyphs!!

Digging a bit deeper I notices that there where strange runes mapping to the same glyph ID. I ended up changing the loop logic within parseCmapFormat12 and managed to get the full character set:

func (t *ttfParser) parseCmapFormat12() error {

    numGroups := t.ReadULong()

    common.Log.Trace("parseCmapFormat12: %s numGroups=%d", t.rec.String(), numGroups)

    for i := uint32(0); i < numGroups; i++ {
        firstCode := t.ReadULong()
        endCode := t.ReadULong()
        startGlyph := t.ReadULong()

        if firstCode > 0x0010FFFF || (0xD800 <= firstCode && firstCode <= 0xDFFF) {
            return errors.New("invalid characters codes")
        }

        if endCode < firstCode || endCode > 0x0010FFFF || (0xD800 <= endCode && endCode <= 0xDFFF) {
            return errors.New("invalid characters codes")
        }

        for j := firstCode; j <= endCode; j++ {
            if j > 0x10FFFF {
                common.Log.Debug("Format 12 cmap contains character beyond UCS-4")
            }
            t.rec.Chars[rune(j)] = GID(startGlyph)
            startGlyph++
        }

    }

    return nil
}

What are the side effects of parsing more sub tables? Is this a viable solution?

I would be happy to submit a PR, but I know you are now working in a closed repo.

Any thoughts @gunnsth on getting this fixed/added?

rogchap commented 3 years ago

FYI: Also tried my "fix" with OpenSansEmoji and it also holds emoji glyphs in (3,10) so rendered correctly. Also noting that Font Subsetting cannot be used so a complete fix would also allow subsetting for (3,10)

gunnsth commented 3 years ago

@rogchap Thanks for the interest and great passion for getting this to work. This seems like something we would like to have working. Are you willing to contribute this to our closed repo? We would have it reviewed and added, and it would be added to an upcoming release.

rogchap commented 3 years ago

@gunnsth Happy to contribute; let me know what the steps are. Would be a great benefit to have this in a release.

gunnsth commented 3 years ago

@rogchap So sorry about the late reply and thanks a lot for your contribution here. This has been incorporated in the newest releases of unipdf so emojis should work fine, both when using the creator and also when subsetting fonts and such including emojis.