Open gjvnq opened 1 year ago
I ran a second experiment on the same input but changing the the end tag from %}
to just }
and this results in no errors.
Grammar:
module.exports = grammar({
name: 'testlang',
rules: {
source_file: $ => repeat($.comment),
_non_comment_text: $ => choice(/([^{}%]+|[^{]%|%[^}])+/),
comment: $ => seq('{%', $._non_comment_text, '}'),
}
})
Output:
new_parse
process version:0, version_count:1, state:1, row:0, col:0
lex_internal state:0, row:0, column:0
consume character:'{'
consume character:'%'
lexed_lookahead sym:{%, size:2
shift state:4
process version:0, version_count:1, state:4, row:0, col:2
lex_internal state:2, row:0, column:2
consume character:'0'
consume character:'1'
consume character:'2'
consume character:'%'
consume character:'3'
consume character:'4'
consume character:'5'
consume character:'6'
consume character:'%'
consume character:'}'
lexed_lookahead sym:_non_comment_text_token1, size:9
shift state:7
process version:0, version_count:1, state:7, row:0, col:11
lex_internal state:0, row:0, column:11
consume character:'}'
lexed_lookahead sym:}, size:1
shift state:5
process version:0, version_count:1, state:5, row:0, col:12
lex_internal state:0, row:0, column:12
lexed_lookahead sym:end, size:0
reduce sym:comment, child_count:3
reduce sym:source_file, child_count:1
accept
done
(source_file [0, 0] - [0, 12]
(comment [0, 0] - [0, 12]))
I agree that for now Tree-sitter doesn't take into account an alternatives order and precedence. And instead of that builds a one common set for a maximally greedy match by "merging" alternatives.
Lets experiment with v8
engine in a Chrome browser by using a helper function:
function retest(re, input) {
const result = re.exec(input);
console.log(`#retest(${re}, "${input}");`);
console.log("regex:", re.source);
console.log("input:", input);
for (let i = 0; i < result.length; i++) {
console.log(`match[${i}]: ${result[i]}`);
}
}
And with Rust (regex) with a following helper program:
use anyhow::Result;
use regex::Regex;
#[fncli::cli]
fn main(regex: String, input: String) -> Result<()> {
let re = Regex::new(regex.as_str())?;
println!("#retest '{regex}' '{input}'");
println!("regex: {}", re.as_str());
println!("input: {input}");
for (_gi, group) in re.captures_iter(input.as_str()).enumerate() {
for (ci, capture) in group.iter().enumerate() {
if let Some(r#match) = capture {
let s = r#match.as_str();
println!("match[{ci}]: {s}");
}
}
break;
}
println!();
Ok(())
}
Regexes in real regex engines:
# JS (V8 Chrome) # Rust (regex)
#retest(/([^{}%]+|[^{]%|%[^}])+/, "012%3456%}"); #retest '([^{}%]+|[^{]%|%[^}])+' '012%3456%}'
regex: ([^{}%]+|[^{]%|%[^}])+ regex: ([^{}%]+|[^{]%|%[^}])+
input: 012%3456%} input: 012%3456%}
match[0]: 012%3456 match[0]: 012%3456
match[1]: 456 match[1]: 456
# Comment: the above regex can match a longest match due to `(...)+` wrap
# that allows to try all alternatives from different start points.
#retest(/[^{}%]+|[^{]%|%[^}]/, "{%012%3456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '{%012%3456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: {%012%3456%} input: {%012%3456%}
match[0]: %0 match[0]: %0
#retest(/[^{]%|%[^}]|[^{}%]+/, "{%012%3456%}"); #retest '[^{]%|%[^}]|[^{}%]+' '{%012%3456%}'
regex: [^{]%|%[^}]|[^{}%]+ regex: [^{]%|%[^}]|[^{}%]+
input: {%012%3456%} input: {%012%3456%}
match[0]: %0 match[0]: %0
# Comment: if regex engine falls into some alternative by some beginning char
# it can't jump to a later alternative even if it would provide longer match.
#retest(/[^{}%]+|[^{]%|%[^}]/, "012%3456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '012%3456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: 012%3456%} input: 012%3456%}
match[0]: 012 match[0]: 012
#retest(/[^{}%]+|[^{]%|%[^}]/, "2%3456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '2%3456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: 2%3456%} input: 2%3456%}
match[0]: 2 match[0]: 2
#retest(/[^{}%]+|[^{]%|%[^}]/, "%3456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '%3456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: %3456%} input: %3456%}
match[0]: %3 match[0]: %3
#retest(/[^{}%]+|[^{]%|%[^}]/, "0123456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '0123456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: 0123456%} input: 0123456%}
match[0]: 0123456 match[0]: 0123456
#retest(/[^{}%]+|[^{]%|%[^}]/, "3456%}"); #retest '[^{}%]+|[^{]%|%[^}]' '3456%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: 3456%} input: 3456%}
match[0]: 3456 match[0]: 3456
#retest(/[^{}%]+|[^{]%|%[^}]/, "6%}"); #retest '[^{}%]+|[^{]%|%[^}]' '6%}'
regex: [^{}%]+|[^{]%|%[^}] regex: [^{}%]+|[^{]%|%[^}]
input: 6%} input: 6%}
match[0]: 6 match[0]: 6
#retest(/[^{]%|%[^}|[^{}%]+]/, "6%}"); #retest '[^{]%|%[^}|[^{}%]+]' '6%}'
regex: [^{]%|%[^}|[^{}%]+] regex: [^{]%|%[^}|[^{}%]+]
input: 6%} input: 6%}
match[0]: 6% match[0]: 6%
# Comment: reordering has an effect by comparing with the previous run,
# alternatives has an ordering precedence.
@gjvnq Thank you for the report and catching this!
But I'm not sure that such Tree-sitter behavior should be fixed for sure and Tree-sitter will start to support real regex alternatives behavior or for some reasons (performance?) stay on the current behavior where all regex alternatives serve for a purpose to select a united subset of Unicode symbols from the whole Unicode symbols set that would do greedy matching. If to stay on the current then such behavior needs to be documented.
/cc: @maxbrunsfeld could you tell what do you think about the current behavior and add more context, please?
I did some extra experimentation and I discovered two working workarounds:
module.exports = grammar({
name: 'testlang',
rules: {
source_file: $ => repeat($.comment),
_non_comment_text: $ => choice(/[^{}%]+/, /%[^}]/, /[^{]%/),
comment: $ => seq('{%', repeat($._non_comment_text), '%}'),
}
})
module.exports = grammar({
name: 'testlang',
rules: {
source_file: $ => repeat($.comment),
_non_comment_text: $ => choice(/[^{}%]+/, seq('%', /[^}]/), seq(/[^{]/, '%')),
comment: $ => seq('{%', repeat($._non_comment_text), '%}'),
}
})
This doesn't seem useful for fixing this bug but does seem useful for tree-sitter users who might be struggling with the same issue I did.
This is a better version:
module.exports = grammar({
name: 'testlang',
rules: {
source_file: $ => repeat($.comment),
comment_text: $ => choice(
token.immediate(/[^{}%]+/),
token.immediate(/%[^}]/),
token.immediate(/[^{]%/)),
comment: $ => seq('{%', repeat(choice($.comment_text, $.comment)), '%}'),
}
})
@gjvnq thanks, I also thought to compare with the choice
func.
The use of the token.immediate
doesn't make sense if you don't exclude spaces from regexes.
The next is what you've tried to express by the initial grammar (in the issue description):
module.exports = grammar({
name: 'testlang',
rules: {
source_file: $ => repeat($.comment),
comment_text: $ => repeat1(choice(
/[^{}% ]+/,
token.immediate(/[^{ ]%/),
token.immediate(/%[^} ]/),
)),
comment: $ => seq('{%', $.comment_text, '%}'),
}
})
I think I found a bug with how tree-sitter 0.20.8 processes regular expressions. When the regex ends, tree-sitter seems to assume it stopped at where it stopped processing characters instead of where it stopped matching characters. It seems like it can backtrack when needed but only up to one character.
Grammar:
Input:
{%012%3456%}
Command:
tree-sitter generate && tree-sitter parse example-file2 -d
Output:
According to regex.com,
/([^{}%]+|[^{]%|%[^}])+/
on012%3456%}
should match012%3456
but it seems like tree-sitter thinks it matched012%3456%
as it immediately proceeds to analysing the character}
instead of%
.