mathiasbynens / he

A robust HTML entity encoder/decoder written in JavaScript.
https://mths.be/he
MIT License
3.45k stars 254 forks source link

For strictly browser-side code, is there any reason to use this library in favour of hacks involving DOM elements' innerHTML and innerText properties? #18

Closed ExplodingCabbage closed 10 years ago

ExplodingCabbage commented 10 years ago

Consider the following Stack Overflow answer to the question How to decode HTML entities using jQuery?

Just do:

var decoded = $('<textarea/>').html(encoded).val();

where encoded is your string containing HTML entities that you wish to decode.

This works similarly to the accepted answer, but is safe to use with untrusted user input.

As noted by Mike Samuel, doing this with a <div> instead of a <textarea> with untrusted user input is an XSS vulnerability, even if the <div> is never added to the DOM:

// Shows the alert in Firefox and Safari (and returns an empty string)
$("<div/>").html(
    '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>'
).text()

However, this attack is not possible against a <textarea> because there are no HTML elements that are permitted content of a <textarea>. Consequently, any HTML tags still present in the 'encoded' string will be automatically entity-encoded by the browser.

// This is safe (and returns the right answer)
$("<textarea/>").html(
    '<img src="//www.google.com/images/logos/ps_logo2.png" onload=alert(1337)>'
).text()

Previously, the answer just included the first code snippet. I recently edited the answer to note the rationale behind using a textarea instead of a div. However, I'm a little uneasy, because I know that your library exists and is not (as far as I can tell) strictly targeting node users. I find myself wondering why.

I'll probably post a link to this library as an answer (unless you'd like to do so yourself) to that question regardless, since I figure that people who are using node may benefit from having a single solution that is usable both clientside and serverside. But how about everyone else? What reason is there for anyone to serve a 300 line script to serve a purpose that can - it seems to my naive eyes - be done in 50 characters with a clever hack?

Are there any situations at all in which the textarea hack fails (or at least is not guaranteed by spec to succeed)? I confess to being slightly uneasy about it since I don't know where (or for that matter, if) the spec determines the behaviour of browsers when presented with HTML elements containing disallowed children, like

<textarea>
    <p>I'm not really supposed to be here.</p>
</textarea>

but from the testing I've done, it seems to work.

Sorry to offload a question like this onto you, but it seems to be right in your area of expertise and is relevant when figuring out to whom this library is useful. (Indeed, if there is something profoundly wrong with the textarea hack, it almost seems worth noting that in this library's README - otherwise, the case for using a library for this purpose at all is unclear).

mathiasbynens commented 10 years ago

Good question!

The main goal of he is to encode non-ASCII symbols into HTML entities, and to be able to decode these in all their forms, i.e. he.encode() and he.decode().

The encoding part is probably most useful as part of a build script, or as part of a Node.js application that outputs that data as part of a response. The decoding part is the hardest (and probably the main reason why one would use he), as there are so many different ways to encode each character, and there are a lot of weird exceptions and edge cases. If you want to decode HTML entities according to the spec, in any environment, then you definitely need he.

On the client side, at run-time, escaping non-ASCII symbols (like he.encode() does) before setting it as .innerHTML won’t really make a difference – only escaping the unsafe characters would matter in that case.

If your only goal is to escape HTML like the he.escape() helper method does, then he is probably overkill.

While the <textarea> hack works, it feels very hacky to me, and it won’t work in non-browser environments (like you mentioned). Even in browser environments it might give results that are in violation of the spec. Yep, some browsers have buggy implementations of named character references — try http://mathias.html5.org/tests/html/named-character-references/ in IE, for example. Try older browser versions too.

Just .replace()ing the characters as needed (like he.escape() or _.escape() do) seems much simpler, less hacky, ensures the output is predictable/deterministic, and it’s probably faster, too.

ExplodingCabbage commented 10 years ago

Thanks for the reply - I think it resolves my question fully. BTW, I went ahead and posted an answer on SO about your library. Naturally, feel free to tweak it if you reckon I've missed anything important or said anything dumb. :)

mathiasbynens commented 10 years ago

:+1:

ExplodingCabbage commented 9 years ago

The punchline to all this, which might interest you: you were right to be turned off by the <textarea> hack. It turns out that in jQuery 1.8 and below, the code given in http://stackoverflow.com/a/1395954/1709587 is XSS-vulnerable, because .html() in those versions of jQuery would explicitly and deliberately run scripts in the given HTML string. A commenter gives the example of $("<textarea/>").html('<script>alert("lol")</script>').text(), which will show an alert on jQuery 1.7.

I am glad to have offered up your library as an alternative answer, but sad to have polished up the insecure <textarea> answer and edited in reassurances about it being secure. :( Fixing now.

msikma commented 9 years ago

Good update to the question. :+1:

Very nice and thoughtful reply too. Indeed jQuery 1.8 and below runs scripts in HTML strings, and this is deliberate. It's useful in some situations—I remember once making a Tumblr theme with infinite scrolling that needed to execute <script> tags to enable dynamic content, because of how limited Tumblr's theming interface is. It allows only entire pieces of HTML to be inserted into the page (that is, if you want non-JS compatibility).

licaomeng commented 4 years ago

Nice discussion