mathiasbynens / utf8.js

A robust JavaScript implementation of a UTF-8 encoder/decoder, as defined by the Encoding Standard.
https://git.io/utf8js
MIT License
556 stars 115 forks source link

Add error-tolerant mode #19

Open darrachequesne opened 8 years ago

darrachequesne commented 8 years ago

Closes #2 and #5

coveralls commented 8 years ago

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling c373d19c7f4da64ed4fbb968385d9d8f9a530a95 on darrachequesne:patch-1 into 2fa80fac3fee7ef9a285f0fab45bceb86b59dd78 on mathiasbynens:master.

darrachequesne commented 8 years ago

@mathiasbynens does that implementation comply with what you had in mind? Could you please review when you have time?

mathiasbynens commented 8 years ago

Of course! It might take a while until I get around to it, though.

darrachequesne commented 8 years ago

No problem! Please tell me if I can help in any way.

darrachequesne commented 7 years ago

Hi @mathiasbynens ! Do you know when you'll be able to review that PR please?

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.4%) to 92.958% when pulling 41c4eef0de26d9be4ba968657583bb2d2092db48 on darrachequesne:patch-1 into 5566334e1aa5347ba652c38dc186df08b47d8fb9 on mathiasbynens:master.

chharvey commented 3 years ago

@darrachequesne Does this handle the case of missing or extra continuation bytes?

The encoding 1110xxxx 10xxxxxx 10xxxxxx 0xxxxxxx (a 3-sequence followed by a 1-sequence) is well-formed and decodes to two codepoints. But if one of the “continuation bytes” was lost in transmission,1110xxxx 10xxxxxx 0xxxxxxx would error. With {strict: false}, we would want the first character to resolve to U+FFFD instead of erroring, and the second character to resolve as normal. Example:

utf8.decode(
    '\xE2\xAC\xE2\x82\xAC', // 11100010 10101100 11100010 10000010 10101100
    {strict: false},
) === '\uFFFD\u20AC';

Likewise, 1110xxxx 10xxxxxx 10xxxxxx 10xxxxxx is not well-formed either. With strict turned off, the first character (the 3-sequence) should resolve as normal, but then U+FFFD should be returned for any remaining continuation bytes until the next “header byte” (that is, a byte starting with 00, 01, or 11) is found. Example:

utf8.decode(
    '\xE2\x82\xAC\x82\xAC\xE2\x82\xAC', // 11100010 10000010 10101100 10000010 10101100 11100010 10000010 10101100
    {strict: false},
) === '\u20AC\uFFFD\u20AC';