Open darrachequesne opened 8 years ago
@mathiasbynens does that implementation comply with what you had in mind? Could you please review when you have time?
Of course! It might take a while until I get around to it, though.
No problem! Please tell me if I can help in any way.
Hi @mathiasbynens ! Do you know when you'll be able to review that PR please?
@darrachequesne Does this handle the case of missing or extra continuation bytes?
The encoding 1110xxxx 10xxxxxx 10xxxxxx 0xxxxxxx
(a 3-sequence followed by a 1-sequence) is well-formed and decodes to two codepoints. But if one of the “continuation bytes” was lost in transmission,1110xxxx 10xxxxxx 0xxxxxxx
would error. With {strict: false}
, we would want the first character to resolve to U+FFFD instead of erroring, and the second character to resolve as normal. Example:
utf8.decode(
'\xE2\xAC\xE2\x82\xAC', // 11100010 10101100 11100010 10000010 10101100
{strict: false},
) === '\uFFFD\u20AC';
Likewise, 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx
is not well-formed either. With strict turned off, the first character (the 3-sequence) should resolve as normal, but then U+FFFD should be returned for any remaining continuation bytes until the next “header byte” (that is, a byte starting with 00
, 01
, or 11
) is found. Example:
utf8.decode(
'\xE2\x82\xAC\x82\xAC\xE2\x82\xAC', // 11100010 10000010 10101100 10000010 10101100 11100010 10000010 10101100
{strict: false},
) === '\u20AC\uFFFD\u20AC';
Closes #2 and #5