mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
779 stars 146 forks source link

Erroneous UTF-16 Surrogate Decoding #3

Closed tin-pot closed 7 years ago

tin-pot commented 7 years ago

The pertinent macros to detect and decode UTF-16 surrogate code units in md4c.c are (7d20152c39dbf094a774bbf34a808bf689dd2b6a):

#define IS_UTF16_SURROGATE_HI(word)     (((WORD)(word) & 0xfc) == 0xd800)
#define IS_UTF16_SURROGATE_LO(word)     (((WORD)(word) & 0xfc) == 0xdc00)
#define UTF16_DECODE_SURROGATE(hi, lo)  ((((unsigned)(hi) & 0x3ff) << 10) | (((unsigned)(lo) & 0x3ff) << 0))

The constant 0xfc in the first two lines should read 0xfc00.

The cast to WORD seems pointless, the actual argument to these macros is always a wchar_t expression, which (in MSC) is promoted to 32-bit int without sign extension. (Furthermore WORD is defined in <windows.h>, and currently the only name used from there ...!). Thus the defining expression could be:

((word) & 0xfc00) == 0xd800

The expression to compose the Unicode code point from the two 10-bit fragments omits the bias value 0x10000; it should read:

0x10000 + ((((hi) & 0x3ff) << 10) | ((lo) & 0x3ff))

or - using only required parentheses -:

0x10000 + ( ((hi) & 0x3ff) << 10  |  (lo) & 0x3ff )
mity commented 7 years ago

Thanks for the report.

Grrr. I will have to make some kind of test suite to cover the UTF-16 soon to avoid this kind of issues. The problem is the CommonMark testsuite I currently reuse is not ready for it at all.

tin-pot commented 7 years ago

I haven't found this one by testing, either. The only reason I've seen it just reading the code was that I made the exact same error once and remember painfully well how much time I wasted staring at operator precedence, sign extension in sub-expressions etc until the 0x10000 shift dawned upon me - if that gives you consolation ;-)