rspivak / slimit

SlimIt - a JavaScript minifier/parser in Python
MIT License
550 stars 94 forks source link

Parse function breaks when there’s a line ending in a string #100

Open ariasuni opened 6 years ago

ariasuni commented 6 years ago
In: tree = parser.parse('var i = "test\nvalue"')
Illegal character '"' at 1:8 after LexToken(EQ,'=',1,6)
Illegal character '"' at 1:19 after LexToken(ID,'value',1,14)

In: tree.to_ecma()
'var i = test;\nvalue;'

The behavior is the same with \r.

metatoaster commented 6 years ago

No, never mind, if an actual newline character occur inside a string token and actual new line, Node.js doesn't even like it either.

$ cat | node
var i = "test
value"
[stdin]:1
var i = "test
        ^^^^^

SyntaxError: Invalid or unexpected token

ES5 (which is what slimit supports) doesn't have multiline strings like Python does, so fortunately for the parsers, this is a a valid syntax error in the provided ES5 script which the parser correctly provided.

However, if you meant to an escaped sequence representing the newline, this will then work (note the raw string prefix r):

>>> from slimit.parser import Parser
>>> print(Parser().parse(r'var i = "test\nvalue"').to_ecma())
var i = "test\nvalue";
ariasuni commented 6 years ago

Well I had this problem when trying to scrape information out of a working JavaScript code on a high-traffic website.

metatoaster commented 6 years ago

Can you please provide the link to the example that choked?

metatoaster commented 6 years ago

Anyway, I do see what you mean - I had mistakenly used my patched version of slimit that correctly reported that as a parsing error. Anyway, the correct behavior with that input should throw a SyntaxError exception, which my patched version (and calmjs.parse) does. The definition in the ECMA-262 specification that states this as an invalid syntax is defined in section 7.8.4 (specifically "A line terminator character cannot appear in a string literal" at the bottom of that section, where a "line terminator" includes newline characters)

To make things most clear, this is the input JavaScript with the invalid syntax:

var i = "test
value"

Assume that input is assigned to program in the following Python code:

>>> from slimit.parser import Parser
>>> parser = Parser()
>>> node = parser.parse(program)
Illegal character '"' at 1:8 after LexToken(EQ,'=',1,6)
Illegal character '"' at 1:19 after LexToken(ID,'value',1,14)
>>> print(node.to_ecma())
var i = test;
value;

This changed the program entirely, as slimit erroneously fully parsed the input without raising an error and produced an incorrect AST, and this is where my initial confusion lied (when I saw the output which I then used as input, then I noticed the quotes on the original input). The correct behavior is implemented in calmjs.parse, which correctly process this as a syntax error:

>>> from calmjs.parse import es5
>>> es5(program)
Traceback (most recent call last):
...
calmjs.parse.exceptions.ECMASyntaxError: Illegal character '"' at 1:9 after '=' at 1:7
ariasuni commented 6 years ago

Well, I’m probably mistaken: this should have been a non-working JavaScript extract among the working ones, because I have the same kind of error with newlines inside strings in my web browser’s console.