Accidentally parsing HTML entities

isaacs / sax-js

A sax style parser for JS

Other

1.09k stars 325 forks source link

Accidentally parsing HTML entities #97

Open Pomax opened 11 years ago

Pomax commented 11 years ago

I was using your sax package because it claims to be be XML only, but it turns out it's also parsing entities that are decidedly not XML entities. XML only has five (quot, amp, apos, lt and gt), but to my surprise it also happily replaced input that contained ∫ with output that contained the integral symbol, rather than leaving it alone. I was using non-strict parsing, since the XML file I'm parsing has a huge list of entities that I'd rather have ignored and passed literally than errored on, but it seems to be doing more than simply passing them on =)

Demonstrator code:

var sax = require('sax'), parser = sax.parser(false);
parser.ontext = function(t) { console.log(t); }
parser.write("<xml>&int;</xml>").close();

This produces the output ∫

isaacs commented 11 years ago

I'd accept a patch to make this a configurable option. You already have the test mostly written :)

Pomax commented 11 years ago

unfortunately, the number of projects I'm already working on doesn't afford me much time for writing that in the near future, so I'm going to have to leave it at only signalling the problem for now =(

isaacs commented 11 years ago

I can feel that.

joeyhoer commented 10 years ago

I've been tracking down a bug in SVGO (which uses sax for SVG parsing), and I believe it to be directly related to this issue. See https://github.com/svg/svgo/issues/140 for a short history.

Performing a test similar to the one above using parser.write("<svg>🔥</svg>").close(); will incorrectly encode the hex entity. The expected result is :fire:, but the result is  (which is equivalent to ).

Update: Upon further investigation, this appears to be an issue in using the String.fromCharCode method. Apparently, it doesn't work so well on higher values since the higher code point characters use two (lower value) "surrogate" numbers to form a single character. Looks like we should be using a shim for the ES6 String.fromCodePoint method (either the one included on the MDN article or something more robust).

isaacs commented 10 years ago

Astral plane chars are going to fail because it's calling String.fromCharCode on the parsed int, which only looks at the two right-most bytes. What you want is for the string to be "\ud83d\udd25", with the surrogate pair split up.

I'd accept a patch that properly split up surrogate pairs before calling String.fromCharCode().

For example:

> fire = String.fromCharCode(0xd83d) + String.fromCharCode(0xdd25)
////// Github won't let me post this.  But it prints out the actual fire glyph on my terminal.
> fire.length
2
> fire.charCodeAt(0)
55357
> fire.charCodeAt(1)
56613
> mojibake = String.fromCharCode(0x1f525)
''
> mojibake.length
1
> mojibake.charCodeAt(0)
62757
> mojibake.charCodeAt(0).toString(16)
'f525'

Pomax commented 10 years ago

that seems a different issue than the one this issue was opened for, though.

isaacs commented 10 years ago

Oh, right, it is different, that's right. Reopening