What should happen when code points from the overrides table are encoded?

mathiasbynens / he

A robust HTML entity encoder/decoder written in JavaScript.

MIT License

3.43k stars 255 forks source link

http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides

Current behavior:

> he.encode('\x80')
'&#x80;'

 is an invalid character reference (parse error) but then again, using the raw U+0080 character is just as invalid. The difference is that U+0080 in HTML source gets ignored, while  becomes € due to the overrides table.

Should we continue to return invalid entities, knowing they might map to a completely different symbol? Or should we not escape any invalid code points in the input? Or should we strip invalid characters from the input? Should there be a strict option for encode as well (just like there is for decode) which errors in case an invalid character is part of the source?

cc @zcorpan

mathiasbynens / he

What should happen when code points from the overrides table are encoded? #19