Lexer does not handle Javascript regular expression literals

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Quotes inside javascript regular expression break syntax highlighting.

(Please include HTML, not just your source code)
It actually breaks on Douglas' Crockford's parseJSON code 
http://www.json.org/json.js

Here's the actual break point

  s.parseJSON = function (filter) {
    // Parsing happens in three stages. In the first stage, we run the text
against
    // a regular expression which looks for non-JSON characters. We are
especially
    // concerned with '()' and 'new' because they can cause invocation, and '='
    // because it can cause mutation. But just to be safe, we will reject all
    // unexpected characters.

    try {
      if (/^("(\\.|[^"\\\n\r])*?"|[,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t])+?$/.
        test(this)) {

          // In the second stage we use the eval function to compile the
text into a
          // JavaScript structure. The '{' operator is subject to a
syntactic ambiguity
          // in JavaScript: it can begin a block or an object literal. We
wrap the text
          // in parens to eliminate the ambiguity.

What is the expected output?  What do you see instead?

Expected the prettifier to use / as delimiters and properly parse the
single quotation in the regular expression.

What version are you using?  On what browser?
Apr 02 version.

Please provide any additional information below.

Original issue reported on code.google.com by phunl...@gmail.com on 22 Apr 2007 at 3:30

GoogleCodeExporter commented 9 years ago

Thanks for the bug report.

From http://www.mozilla.org/js/language/js20/rationale/syntax.html
     "To support error recovery, JavaScript 2.0's lexical grammar must be
      made independent of its syntactic grammar. To make the lexical grammar
      independent of the syntactic grammar, JavaScript 2.0 determines whether
      a / starts a regular expression or is a division (or /=) operator solely
      based on the previous token."

That page then lists the tokens that can precede a Regex literal, and  says:
     "Regardless of the previous token, // is interpreted as the beginning
      of a comment."

Original comment by mikesamuel@gmail.com on 8 May 2007 at 6:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I'm going to assume that "the previous token" does not consider either comment 
or
whitespace tokens.

I'm further going to assume that the list of preceding tokens
        "!", "!=", "!==", "#", "%", "%=", "&", "&&", "&&=", "&=", "(", "*",
        "*=", "+", "+=", ",", "-", "-=", "->", ".", "..", "...", "/", "/=", ":",
        "::", ";", "<", "<<", "<<=", "<=", "=", "==", "===", ">", ">=", ">>",
        ">>=", ">>>", ">>>=", "?", "@", "[", "^", "^=", "^^", "^^=", "{", "|",
        "|=", "||", "||=", "~", "abstract", "break", "case", "catch", "class",
        "const", "continue", "debugger", "default", "delete", "do", "else",
        "enum", "export", "extends", "field", "final", "finally", "for",
        "function", "goto", "if", "implements", "import", "in", "instanceof",
        "is", "namespace", "native", "new", "package", "return", "static",
        "switch", "synchronized", "throw", "throws", "transient", "try",
        "typeof", "use", "var", "volatile", "while", "with",

So I'll need to check that the '.' is not the tail of a number.

Also, since I'm trying to come up with a lexical scheme that supports reasonably
readable code in a variety of languages I think I'll skip the keywords in this 
list
that are not keywords in most languages -- `debugger`, `function`, and `field` 
come
to mind, and `in` and `with` might cause problems as well.  `with` in js has to 
be
followed by an open paren, but `in` might present problems.

Removing the set of keywords, that in javascript cannot legally be followed by a
regexp literal according to the grammar yields
        "!", "!=", "!==", "#", "%", "%=", "&", "&&", "&&=", "&=", "(", "*",
        "*=", "+", "+=", ",", "-", "-=", "->", ".", "..", "...", "/", "/=", ":",
        "::", ";", "<", "<<", "<<=", "<=", "=", "==", "===", ">", ">=", ">>",
        ">>=", ">>>", ">>>=", "?", "@", "[", "^", "^=", "^^", "^^=", "{", "|",
        "|=", "||", "||=", "~", "break", "case", "catch",
        "continue", "delete", "do", "else", "finally",
        "in", "instanceof", "is", "return", "throw", "try", "typeof",

"is" and "in" are problematic, and are recently introduced language features.  
I'm
inclined to skip them too.

Original comment by mikesamuel@gmail.com on 8 May 2007 at 6:35

Changed title: _Lexer does not handle Javascript regular expression literals _
Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Fixed: implemented lexing of regular expression literals using
an approach based on javascripts lexical grammar to decide when a /
begins a regexp literal.

See testcase at
http://google-code-prettify.googlecode.com/svn/trunk/tests/prettify_test.html#is
sue12

This is more conservative than javascript since I don't attempt to handle
lexically valid but syntactically invalid javascript.

There is one case where a regexp literal in a syntactically valid javascript
will not be recognized
    for (var fieldName in /foo/) {
      ...
    }
I have never seen this in practice.  Someone might iterate over a regexp to
iterate out parenthetical matches, but they would have to assign the regexp
to a variable first, since javascript does not allow pooling of regexp
literals.

Original comment by mikesamuel@gmail.com on 23 May 2007 at 4:09

Changed state: Fixed
Added labels: ****
Removed labels: ****

tylerlong / google-code-prettify

Lexer does not handle Javascript regular expression literals #12