tdewolff / parse

Go parsers for web formats
MIT License
413 stars 65 forks source link

JS: Tokenization bug in regular expression literals #117

Closed GraphR00t closed 8 months ago

GraphR00t commented 8 months ago

The '#' character causes issues when it is inside a regular expression literal.

input: /(^#)/
error: unexpected ) on line 1 and column 5

input: /(^|#)/
error:  unexpected ) on line 1 and column 6

input: /(^|#|a)/
error:  unexpected | on line 1 and column 6

I use the version v2.7.12.

package bug

import (
    "errors"
    "io"
    "testing"

    "github.com/tdewolff/parse/v2"
    "github.com/tdewolff/parse/v2/js"
)

func TestBug(t *testing.T) {

    check := func(t *testing.T, lexer *js.Lexer) {
        for lexer.Err() == nil {
            lexer.Next()
        }

        err := lexer.Err()
        if err != nil && !errors.Is(err, io.EOF) {
            t.Log(err.Error())
            t.Fail()
        }
    }

    t.Run("ok", func(t *testing.T) {
        lexer := js.NewLexer(parse.NewInputString("/(^|a)/"))
        check(t, lexer)
    })

    t.Run("# followed by ')", func(t *testing.T) {
                // unexpected ) on line 1 and column 6
        lexer := js.NewLexer(parse.NewInputString("/(^|#)/"))
        check(t, lexer)
    })

    t.Run("# followed by ')", func(t *testing.T) {
                //  unexpected ) on line 1 and column 15
        lexer := js.NewLexer(parse.NewInputString("var re = /(^|#)/"))
        check(t, lexer)
    })

    t.Run("# followed by '|", func(t *testing.T) {
                // unexpected | on line 1 and column 6
        lexer := js.NewLexer(parse.NewInputString("/(^|#|a)/"))
        check(t, lexer)
    })

    t.Run("# followed by '|", func(t *testing.T) {
                // unexpected | on line 1 and column 15
        lexer := js.NewLexer(parse.NewInputString("var re = /(^|#|a)/"))
        check(t, lexer)
    })
}
tdewolff commented 8 months ago

Unfortunately, this is not a bug but a peculiarity of the ECMAScript specification. The lexer cannot parse a JS file in a valid way, only a parser can. The problem is that RegExp literals depend on the parsing context whether it is expected or not. There is a bit of information here: https://github.com/tdewolff/parse/tree/master/js. Basically, you need to know if a / or /= operator is allowed at the position, if it is not you need to re-lex the same token as a RegExp.

GraphR00t commented 8 months ago

Thank you for that precision.