tree-sitter / tree-sitter

An incremental parsing system for programming tools
https://tree-sitter.github.io
MIT License
18.25k stars 1.4k forks source link

Regex consuming too much #2225

Open gjvnq opened 1 year ago

gjvnq commented 1 year ago

I think I found a bug with how tree-sitter 0.20.8 processes regular expressions. When the regex ends, tree-sitter seems to assume it stopped at where it stopped processing characters instead of where it stopped matching characters. It seems like it can backtrack when needed but only up to one character.

Grammar:

module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        _non_comment_text: $ => choice(/([^{}%]+|[^{]%|%[^}])+/),
        comment: $ => seq('{%', $._non_comment_text, '%}'),
    }
})

Input: {%012%3456%}

Command: tree-sitter generate && tree-sitter parse example-file2 -d

Output:

new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_internal state:0, row:0, column:0
  consume character:'{'
  consume character:'%'
lexed_lookahead sym:{%, size:2
shift state:4
process version:0, version_count:1, state:4, row:0, col:2
lex_internal state:2, row:0, column:2
  consume character:'0'
  consume character:'1'
  consume character:'2'
  consume character:'%'
  consume character:'3'
  consume character:'4'
  consume character:'5'
  consume character:'6'
  consume character:'%' <-----
  consume character:'}' <-----
lexed_lookahead sym:_non_comment_text_token1, size:9
shift state:7
process version:0, version_count:1, state:7, row:0, col:11
lex_internal state:0, row:0, column:11
lex_internal state:0, row:0, column:11
skip_unrecognized_character
  consume character:'}' <-----
lex_internal state:0, row:0, column:12
lexed_lookahead sym:ERROR, size:1
detect_error
resume version:0
skip_token symbol:ERROR
process version:0, version_count:1, state:0, row:0, col:12
lex_internal state:0, row:0, column:12
lexed_lookahead sym:end, size:0
recover_to_previous state:1, depth:4
recover_eof
process version:1, version_count:2, state:1, row:0, col:12
reduce sym:source_file, child_count:0
accept
done
(source_file [0, 0] - [0, 12]
  (ERROR [0, 0] - [0, 12]
    (ERROR [0, 11] - [0, 12])))
example-file2   0 ms    (ERROR [0, 0] - [0, 12])

According to regex.com, /([^{}%]+|[^{]%|%[^}])+/ on 012%3456%} should match 012%3456 but it seems like tree-sitter thinks it matched 012%3456% as it immediately proceeds to analysing the character } instead of %.

gjvnq commented 1 year ago

I ran a second experiment on the same input but changing the the end tag from %} to just } and this results in no errors.

Grammar:

module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        _non_comment_text: $ => choice(/([^{}%]+|[^{]%|%[^}])+/),
        comment: $ => seq('{%', $._non_comment_text, '}'),
    }
})

Output:

new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_internal state:0, row:0, column:0
  consume character:'{'
  consume character:'%'
lexed_lookahead sym:{%, size:2
shift state:4
process version:0, version_count:1, state:4, row:0, col:2
lex_internal state:2, row:0, column:2
  consume character:'0'
  consume character:'1'
  consume character:'2'
  consume character:'%'
  consume character:'3'
  consume character:'4'
  consume character:'5'
  consume character:'6'
  consume character:'%'
  consume character:'}'
lexed_lookahead sym:_non_comment_text_token1, size:9
shift state:7
process version:0, version_count:1, state:7, row:0, col:11
lex_internal state:0, row:0, column:11
  consume character:'}'
lexed_lookahead sym:}, size:1
shift state:5
process version:0, version_count:1, state:5, row:0, col:12
lex_internal state:0, row:0, column:12
lexed_lookahead sym:end, size:0
reduce sym:comment, child_count:3
reduce sym:source_file, child_count:1
accept
done
(source_file [0, 0] - [0, 12]
  (comment [0, 0] - [0, 12]))
ahlinc commented 1 year ago

I agree that for now Tree-sitter doesn't take into account an alternatives order and precedence. And instead of that builds a one common set for a maximally greedy match by "merging" alternatives.

Lets experiment with v8 engine in a Chrome browser by using a helper function:

function retest(re, input) {
    const result = re.exec(input);
    console.log(`#retest(${re}, "${input}");`);
    console.log("regex:", re.source);
    console.log("input:", input);
    for (let i = 0; i < result.length; i++) {
        console.log(`match[${i}]: ${result[i]}`);
    }
}

And with Rust (regex) with a following helper program:

use anyhow::Result;
use regex::Regex;

#[fncli::cli]
fn main(regex: String, input: String) -> Result<()> {
    let re = Regex::new(regex.as_str())?;
    println!("#retest '{regex}' '{input}'");
    println!("regex: {}", re.as_str());
    println!("input: {input}");
    for (_gi, group) in re.captures_iter(input.as_str()).enumerate() {
        for (ci, capture) in group.iter().enumerate() {
            if let Some(r#match) = capture {
                let s = r#match.as_str();
                println!("match[{ci}]: {s}");
            }
        }
        break;
    }
    println!();
    Ok(())
}

Regexes in real regex engines:

# JS (V8 Chrome)                                   # Rust (regex)

#retest(/([^{}%]+|[^{]%|%[^}])+/, "012%3456%}");   #retest '([^{}%]+|[^{]%|%[^}])+' '012%3456%}'
regex: ([^{}%]+|[^{]%|%[^}])+                      regex: ([^{}%]+|[^{]%|%[^}])+
input: 012%3456%}                                  input: 012%3456%}
match[0]: 012%3456                                 match[0]: 012%3456
match[1]: 456                                      match[1]: 456
# Comment: the above regex can match a longest match due to `(...)+` wrap
#          that allows to try all alternatives from different start points.

#retest(/[^{}%]+|[^{]%|%[^}]/, "{%012%3456%}");    #retest '[^{}%]+|[^{]%|%[^}]' '{%012%3456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: {%012%3456%}                                input: {%012%3456%}
match[0]: %0                                       match[0]: %0

#retest(/[^{]%|%[^}]|[^{}%]+/, "{%012%3456%}");    #retest '[^{]%|%[^}]|[^{}%]+' '{%012%3456%}'
regex: [^{]%|%[^}]|[^{}%]+                         regex: [^{]%|%[^}]|[^{}%]+
input: {%012%3456%}                                input: {%012%3456%}
match[0]: %0                                       match[0]: %0
# Comment: if regex engine falls into some alternative by some beginning char
#          it can't jump to a later alternative even if it would provide longer match.

#retest(/[^{}%]+|[^{]%|%[^}]/, "012%3456%}");      #retest '[^{}%]+|[^{]%|%[^}]' '012%3456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: 012%3456%}                                  input: 012%3456%}
match[0]: 012                                      match[0]: 012

#retest(/[^{}%]+|[^{]%|%[^}]/, "2%3456%}");        #retest '[^{}%]+|[^{]%|%[^}]' '2%3456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: 2%3456%}                                    input: 2%3456%}
match[0]: 2                                        match[0]: 2

#retest(/[^{}%]+|[^{]%|%[^}]/, "%3456%}");         #retest '[^{}%]+|[^{]%|%[^}]' '%3456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: %3456%}                                     input: %3456%}
match[0]: %3                                       match[0]: %3

#retest(/[^{}%]+|[^{]%|%[^}]/, "0123456%}");       #retest '[^{}%]+|[^{]%|%[^}]' '0123456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: 0123456%}                                   input: 0123456%}
match[0]: 0123456                                  match[0]: 0123456

#retest(/[^{}%]+|[^{]%|%[^}]/, "3456%}");          #retest '[^{}%]+|[^{]%|%[^}]' '3456%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: 3456%}                                      input: 3456%}
match[0]: 3456                                     match[0]: 3456

#retest(/[^{}%]+|[^{]%|%[^}]/, "6%}");             #retest '[^{}%]+|[^{]%|%[^}]' '6%}'
regex: [^{}%]+|[^{]%|%[^}]                         regex: [^{}%]+|[^{]%|%[^}]
input: 6%}                                         input: 6%}
match[0]: 6                                        match[0]: 6

#retest(/[^{]%|%[^}|[^{}%]+]/, "6%}");             #retest '[^{]%|%[^}|[^{}%]+]' '6%}'
regex: [^{]%|%[^}|[^{}%]+]                         regex: [^{]%|%[^}|[^{}%]+]
input: 6%}                                         input: 6%}
match[0]: 6%                                       match[0]: 6%
# Comment: reordering has an effect by comparing with the previous run,
#          alternatives has an ordering precedence.
ahlinc commented 1 year ago

@gjvnq Thank you for the report and catching this!

ahlinc commented 1 year ago

But I'm not sure that such Tree-sitter behavior should be fixed for sure and Tree-sitter will start to support real regex alternatives behavior or for some reasons (performance?) stay on the current behavior where all regex alternatives serve for a purpose to select a united subset of Unicode symbols from the whole Unicode symbols set that would do greedy matching. If to stay on the current then such behavior needs to be documented.

ahlinc commented 1 year ago

/cc: @maxbrunsfeld could you tell what do you think about the current behavior and add more context, please?

gjvnq commented 1 year ago

I did some extra experimentation and I discovered two working workarounds:

module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        _non_comment_text: $ => choice(/[^{}%]+/, /%[^}]/, /[^{]%/),
        comment: $ => seq('{%', repeat($._non_comment_text), '%}'),
    }
})
module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        _non_comment_text: $ => choice(/[^{}%]+/, seq('%', /[^}]/), seq(/[^{]/, '%')),
        comment: $ => seq('{%', repeat($._non_comment_text), '%}'),
    }
})

This doesn't seem useful for fixing this bug but does seem useful for tree-sitter users who might be struggling with the same issue I did.


This is a better version:

module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        comment_text: $ => choice(
            token.immediate(/[^{}%]+/),
            token.immediate(/%[^}]/),
            token.immediate(/[^{]%/)),
        comment: $ => seq('{%', repeat(choice($.comment_text, $.comment)), '%}'),
    }
})
ahlinc commented 1 year ago

@gjvnq thanks, I also thought to compare with the choice func. The use of the token.immediate doesn't make sense if you don't exclude spaces from regexes.

The next is what you've tried to express by the initial grammar (in the issue description):

module.exports = grammar({
    name: 'testlang',
    rules: {
        source_file: $ => repeat($.comment),
        comment_text: $ => repeat1(choice(
            /[^{}% ]+/,
            token.immediate(/[^{ ]%/),
            token.immediate(/%[^} ]/),
        )),
        comment: $ => seq('{%', $.comment_text, '%}'),
    }
})

Screenshot from 2023-04-24 09-55-06