jquery / esprima

ECMAScript parsing infrastructure for multipurpose analysis
http://esprima.org
BSD 2-Clause "Simplified" License
7.04k stars 786 forks source link

High Surrogate Unicode value is wrong. #2098

Closed MangaD closed 3 years ago

MangaD commented 3 years ago

High Surrogate Unicode value is wrong.

Reproduce:

  1. Go to: https://esprima.org/demo/parse.html# (or use parseScript in node)
  2. Paste: "\ud800"
  3. Copy the value in the Tree to: https://r12a.github.io/app-conversion/ (may be invisible but copying with the surrounding text works)
  4. Notice that the escape that comes out is U+FFFD
MangaD commented 3 years ago

After careful research on UTF-8 and UTF-16, I have found an explanation for this behavior and will therefore close this issue.

Since RFC 3629 (November 2003), the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and code points not encodable by UTF-16 (those after U+10FFFF) are not legal Unicode values, and their UTF-8 encoding must be treated as an invalid byte sequence. Not decoding unpaired surrogate halves makes it impossible to store invalid UTF-16 (such as Windows filenames or UTF-16 that has been split between the surrogates) as UTF-8.

Some implementations of decoders throw exceptions on errors.[26] This has the disadvantage that it can turn what would otherwise be harmless errors (such as a "no such file" error) into a denial of service.

An alternative practice is to replace errors with a replacement character.

The standard also recommends replacing each error with the replacement character "�" (U+FFFD).

https://en.wikipedia.org/wiki/UTF-8