mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
776 stars 146 forks source link

Let the `entity_lookup()` function return UTF-32, not UTF-8 #12

Closed tin-pot closed 7 years ago

tin-pot commented 7 years ago

The fact that entity_lookup() currently returns the replacement text as a UTF-8 octet sequence is convenient for UTF-8 output (of course), but clumsy otherwise: when generating UTF-16 output, md2html would have to convert UTF-8 into a code point, and write the code point as one or two UTF-16 code units.

While the latter step is trivial, the former is just an unnecessary burden (and in fact, md2html.c won't replace entity references when generating UTF-16 right now).

A better approach would return UTF-32 from entity_lookup(): from this, a renderer could easily

mity commented 7 years ago

That should be trivial to implement as md2html already knows how to render UTF-32 codepoint. But I would wait until @craigbarnes finishes #8.

tin-pot commented 7 years ago

«[...] as md2html already knows how to render UTF-32 codepoint»

Exactly. The main effort would be to generate the entity_table[] initializer in entity.c again, this time with regular code points. Unfortunately, some replacement texts are longer than a single UCS character (but I think no longer than two), so this should suffice:

struct entity {
    const char     *name;     /* Sneak in a better identifier than "verbatim" ;-) */
    unsigned long   utf32[2]; /* One or two UCS code points. Second is unused if U+0000. */
};