Closed mathiasbynens closed 10 years ago
Test document containing a raw U+0080 character + a character reference for U+0080: data:text/html;charset=utf-8,foo%C2%80bar€baz
The raw character is not really ignored – copy-pasting the text reveals that U+0080 is still part of the text (at least in Opera/Chromium). If this is the standard behavior (?) then the only way to preserve the character in the input is to return it as-is, and not escape it. Since he is supposed to emulate browsers, we should probably do it this way.
Note that https://github.com/mathiasbynens/he#strict needs to be updated after this change is made.
http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides
Current behavior:
€
is an invalid character reference (parse error) but then again, using the raw U+0080 character is just as invalid. The difference is that U+0080 in HTML source gets ignored, while€
becomes€
due to the overrides table.Should we continue to return invalid entities, knowing they might map to a completely different symbol? Or should we not escape any invalid code points in the input? Or should we strip invalid characters from the input? Should there be a
strict
option forencode
as well (just like there is fordecode
) which errors in case an invalid character is part of the source?cc @zcorpan