jswebtools / language-ecmascript

Haskell library: ECMAScript parser, pretty-printer and additional tools
Other
46 stars 26 forks source link

`PrettyPrint.jsEscape` doesn't escape tag characters (`<`,`>`): #75

Closed JoeyEremondi closed 8 years ago

JoeyEremondi commented 8 years ago

If you have something like StringLit "<script>alert(\"a\")</script>@emaildomain.con", it generate JS that looks like this:

var testVar = "<script>alert(\"a\")</script>@emaildomain.con";

When read in a browser, this causes a syntax error, because it reads the </script> as the end of the HTML <script> body.

As described in this StackOverflow answer, the following characters should be escaped to avoid interfering with JS or HTML parsers:

<
>
"
 '
\
&
achudnov commented 8 years ago

Strictly speaking, this issue is invalid, for two reasons:

  1. The ECMAScript specification does not concern itself with parseability of a program in an inline script tag.
  2. What you are suggesting seems to be an overkill. The HTML5 spec, section 4.11.1.2 (http://www.w3.org/TR/html5/scripting-1.html#restrictions-for-contents-of-script-elements) talks about additional restrictions on program text to ensure that it is parseable in an inline script tag. It also offers a more measured way of escaping the tricky bits of literals: "The easiest and safest way to avoid the rather strange restrictions described in this section is to always escape "<!--" as "<!--", "<script" as "<\script", and "</script" as "<\/script" when these sequences appear in literals in scripts (e.g. in strings, regular expressions, or comments)".

I'd support adding the approach suggested in the HTML5 spec to jsEscape --- but only if that doesn't change the semantics: i.e. if parsing "<!--" would yield the same program as parsing "<!--"; same for "<script" as "<\script", and "</script" as "<\/script". It looks like that according to the spec that is, indeed, the case for string and regexp literals (it is permitted to have an escape sequences !, \s and \/ that would evaluate to the respective characters !, s and /), but not for comments. However, since language-ecmascript does not output comments in any version, this should be fine.

michaelficarra commented 8 years ago

:-1: It is your responsibility to safely encode the JS output for inclusion in whichever context you choose.

Also, it is not true that

it is permitted to have an escape sequences !, \s and \/ that would evaluate to the respective characters !, s and /

achudnov commented 8 years ago

Michael, focusing on the substantial and informative part of your message: here are my grammar reductions for string and regexp literals. What am I missing?

SingleStringCharacter -> \ EscapeSequence

DoubleStringCharacter -> \ EscapeSequence

RegularExpressionChar ->RegularExpressionBackslashSequence -> \ RegularExpressionNonTerminator

EscapeSequence -> CharacterEscapeSequence -> NonEscapeCharacter -> SourceCharacter but not EscapeCharacter or LineTerminator ('!', 's', '/' are neither in EscapeCharacter nor in LineTerminator)

RegularExpressionNonTerminator -> SourceCharacter but not LineTerminator ('!', 's', '/' are not in LineTerminator)

Paraphrasing ECMA-262 v5, section 7.8.4: EscapeCharacter :: SingleEscapeCharacter | DecimalDigit | x | u SingleEscapeCharacter :: ' | " | \ | b | f | n | r | t | v

michaelficarra commented 8 years ago

I was not refuting your claim that it is possible to have these escape sequences. I was refuting your claim that the semantics would not change in any of the listed contexts if you precede one of the listed characters with a backslash. In a regex context, \s matches any whitespace character.

michaelficarra commented 8 years ago

An actually safe replacement would be to use the \x## or \u#### escape sequences. But I still feel it is a consumer's responsibility.

If you do want to go this route, though, note that null bytes are allowed literally in JavaScript strings and comments, but are not allowed in HTML at all, so must be escaped. For this, you can use \0.

achudnov commented 8 years ago

"In a regex context, \s matches any whitespace character." --- true, I missed that. Thank you, @michaelficarra .

It does look like escaping s in <script with a unicode escape sequence, while using a character escape in the rest would do the job.

achudnov commented 8 years ago

Regarding the consumer responsibility for encoding, I've just had an interesting thought. In fact, <!-- and <script can appear outside of literals and comments:

So, it appears that not only the HTML5 recommendation does not guarantee that a script is safe to appear in an inline script tag. Adjusting jsEscape the way @JoeyEremondi suggests wouldn't either.

Now, I'm actually thinking that this should be done as a post-processing on the whole printed source, unicode escaping the problematic characters !, s and / in substrings <!--, <script and </script. But this is actually independent of the pretty-printer itself since it could ---and, I believe, should--- be done as a String -> String transformation. I could include a function that does this in language-ecmascript, but it would be the user's responsibility to call it on the printed source that is going to be included in an inline script.

I think the same reasoning applies to source found in inline event handlers and the javascript: pseudo-protocol URLs --- though I'm fuzzy on the syntactic restrictions there.

Thoughts?

achudnov commented 8 years ago

Since I haven't seen any further arguments for including this functionality in language-ecmascript, I'm closing the issue.