mathiasbynens / he

A robust HTML entity encoder/decoder written in JavaScript.
https://mths.be/he
MIT License
3.43k stars 255 forks source link

What should happen when code points from the overrides table are encoded? #19

Closed mathiasbynens closed 10 years ago

mathiasbynens commented 10 years ago

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides

Current behavior:

> he.encode('\x80')
'€'

€ is an invalid character reference (parse error) but then again, using the raw U+0080 character is just as invalid. The difference is that U+0080 in HTML source gets ignored, while € becomes due to the overrides table.

Should we continue to return invalid entities, knowing they might map to a completely different symbol? Or should we not escape any invalid code points in the input? Or should we strip invalid characters from the input? Should there be a strict option for encode as well (just like there is for decode) which errors in case an invalid character is part of the source?

cc @zcorpan

mathiasbynens commented 10 years ago

Test document containing a raw U+0080 character + a character reference for U+0080: data:text/html;charset=utf-8,foo%C2%80bar€baz

The raw character is not really ignored – copy-pasting the text reveals that U+0080 is still part of the text (at least in Opera/Chromium). If this is the standard behavior (?) then the only way to preserve the character in the input is to return it as-is, and not escape it. Since he is supposed to emulate browsers, we should probably do it this way.

Note that https://github.com/mathiasbynens/he#strict needs to be updated after this change is made.