Let the `entity_lookup()` function return UTF-32, not UTF-8

tin-pot commented 7 years ago

The fact that entity_lookup() currently returns the replacement text as a UTF-8 octet sequence is convenient for UTF-8 output (of course), but clumsy otherwise: when generating UTF-16 output, md2html would have to convert UTF-8 into a code point, and write the code point as one or two UTF-16 code units.

While the latter step is trivial, the former is just an unnecessary burden (and in fact, md2html.c won't replace entity references when generating UTF-16 right now).

A better approach would return UTF-32 from entity_lookup(): from this, a renderer could easily

output the replacement text in UTF-8,
output the replacement text in UTF-16,
output the replacement text in ASCII (using numerical character references for non-ASCII code points);
output the replacement text in Latin 1 (dito, for code points beyond U+00FF).

"md2html.c - render_ucs_codepoint() for UTF-8 and UTF-16 output"

mity commented 7 years ago

That should be trivial to implement as md2html already knows how to render UTF-32 codepoint. But I would wait until @craigbarnes finishes #8.

tin-pot commented 7 years ago

«[...] as md2html already knows how to render UTF-32 codepoint»

Exactly. The main effort would be to generate the entity_table[] initializer in entity.c again, this time with regular code points. Unfortunately, some replacement texts are longer than a single UCS character (but I think no longer than two), so this should suffice:

struct entity {
    const char     *name;     /* Sneak in a better identifier than "verbatim" ;-) */
    unsigned long   utf32[2]; /* One or two UCS code points. Second is unused if U+0000. */
};

mity / md4c

Let the `entity_lookup()` function return UTF-32, not UTF-8 #12