rspivak / slimit

SlimIt - a JavaScript minifier/parser in Python
MIT License
550 stars 94 forks source link

Lexer does not conform to ECMA-262's definition of whitespace #84

Open jtbraun opened 8 years ago

jtbraun commented 8 years ago

ECMA-262 specifies the allowed whitespace characters in Table 32. slimit complains that these are invalid characters. The spec says:

ECMAScript implementations must recognize as WhiteSpace code points listed in the “Separator, space” (Zs) category by Unicode 5.1. ECMAScript implementations may also recognize as WhiteSpace additional category Zs code points from subsequent editions of the Unicode Standard.

Here's a small test that exhibits some of the problems. There may be other characters in the Zs unicode category that must also be included, I haven't looked for those here.

import re
from slimit.parser import Parser as sParser
from slimit import ast as sAst
from itertools import product
import unicodedata

def replace_spaces(s, wschar):
    yield "WITHOUT REPLACEMENT", s
    offsets = [i for i, c in enumerate(s) if c == ' ']

    try:
        name = unicodedata.name(wschar[0])
    except ValueError:
        name = repr(wschar)

    for i in offsets:
        yield "WITH REPLACEMENT OF " + name, s[:i] + wschar + s[i+1:]

jsparser = sParser()
for src, wschar in product(
        [u" function_name( 'arg' ) "],
        [u"\x09", u"\x0b", u"\x0c",
         u"\x20", u"\xa0",
         u"\uFEFF"]):
    for prefix, js in replace_spaces(src, wschar):
        print prefix, "=>", js
        try:
            tree = jsparser.parse(js)
        except SyntaxError as e:
            print "Syntax error", e
    print