Closed MangaD closed 3 years ago
After careful research on UTF-8 and UTF-16, I have found an explanation for this behavior and will therefore close this issue.
Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence. Not decoding unpaired surrogate halves makes it impossible to store invalid UTF-16 (such as Windows filenames or UTF-16 that has been split between the surrogates) as UTF-8.
Some implementations of decoders throw exceptions on errors.[26] This has the disadvantage that it can turn what would otherwise be harmless errors (such as a "no such file" error) into a denial of service.
An alternative practice is to replace errors with a replacement character.
The standard also recommends replacing each error with the replacement character "�" (U+FFFD).
High Surrogate Unicode value is wrong.
Reproduce:
parseScript
in node)"\ud800"
value
in the Tree to:https://r12a.github.io/app-conversion/
(may be invisible but copying with the surrounding text works)U+FFFD