sublimehq / sublime_text

Issue tracker for Sublime Text
https://www.sublimetext.com
809 stars 39 forks source link

Allow to apply scopes to capture groups inside look ahead/behind expressions #1796

Closed evandrocoan closed 7 years ago

evandrocoan commented 7 years ago

Summary

This problem got months until I figure out it was not working as I expected, because today I decided to perform a syntax text I considered redundant at the first time. Then I find now out despite the captures clause was being ignored. The context would still being pushed on the stack, therefore as I was not completely thoroughly testing the syntax, I was not catching this error.

This issue with the captures clause only happens if the match clause is entirely inside a positive look ahead. Example '(?=(test))' versus '(test)'. The syntax I was developing and had this issue was:

  1. https://github.com/evandrocoan/SublimeAmxxPawn/commit/a095ec7562707a72cc6eb558dc426b86a6a1bb72

Related issues:

  1. https://github.com/SublimeTextIssues/Core/issues/1693 Syntax system enhancement - lookahead past line

Expected behavior

When you do:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(?=(test))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: 'it'
          scope: redundant.scope
          pop: true

It is expected to the word text to have the scope i.am.capturing.it.wow and to the meta_scope pushing.it.yeah be pushed to the stack.

Actual behavior

When you do:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(?=(test))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: 'it'
          scope: redundant.scope
          pop: true

The word text does not have the scope i.am.capturing.it.wow and the meta_scope pushing.it.yeah is pushed to the stack.

Steps to reproduce

1) Create a syntax file named test.sublime-syntax on the User folder:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(?=(test))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: 'it'
          scope: redundant.scope
          pop: true

2) Create a new file and set its syntax to User/test

 test it

3) Check the scopes names for the symbols text and it. You will see:

source.syntax_name pushing.it.yeah 
source.syntax_name pushing.it.yeah redundant.scope 

Instead of:

source.syntax_name pushing.it.yeah i.am.capturing.it.wow 
source.syntax_name pushing.it.yeah redundant.scope 

Environment

borela commented 7 years ago

The issue I mentioned is different, I want to check multiple lines before choosing the right context that will apply the scopes. Your example can be solved by normal matches:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(test)'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: 'it'
          scope: redundant.scope
          pop: true

Or if you really need to the forward check:


%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(?=(test))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: '(test)'
        - captures:
            1: i.am.capturing.it.wow
        - match: 'it'
          scope: redundant.scope
          pop: true
evandrocoan commented 7 years ago

The issue I mentioned is different, I want to check multiple lines before choosing the right context that will apply the scopes.

Sorry. Thought, then your issue should be related to this other:

  1. https://github.com/SublimeTextIssues/Core/issues/1412 A native to apply a scope give a chunk of text
  2. https://github.com/SublimeTextIssues/Core/issues/1430 Support multiple lines in build system's file_regex
    - match: '(?=(test))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: '(test)'
        - captures:
            1: i.am.capturing.it.wow

This is not a solution, as would make no sense to use the look ahead, if I am going to immediately to a full match on the same pattern. On this case I should just to right away:

    - match: '(test)'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah

The example presented is just a simplification of the problem. The captures is not working when it is inside a look ahead. But even worse, it is accepting the pattern inside the look ahead and doing the push.

The original expression where I found this problem is:

    # Handle preprocessor declaration ending a double slash commentary //
    - match: '(?=^\s*\#(define\s+(\w+)).*(\/\/)(.*)\n)'
      captures:
        0: meta.preprocessor.AmxxPawn
        2: support.function.definition.pawn function.definition.AmxxPawn
        3: punctuation.definition.comment.AmxxPawn comment.line.double-slash.AmxxPawn
        4: comment.line.double-slash.AmxxPawn
      push:
        - meta_scope: meta.preprocessor.AmxxPawn
        - include: pawn_string
        - include: preprocessor_numbers
        - match: '(\/\/)(.*)\n'
          captures:
            1: punctuation.definition.comment.AmxxPawn comment.line.double-slash.AmxxPawn
            2: comment.line.double-slash.AmxxPawn
          pop: true

    # From here forward I need to rematch them, that is why I am using the look ahead above
    - match: '[^\\]\n'
      pop: true

I am using look ahead on the first match because I cannot eat everything otherwise I will not be able to apply the others scopes later. As we may see bellow, if I remove the look ahead from the first match, the scoping is completely wrong: image

But if I add it, the it works correctly: image

Now what is not working is the capture associated with the first look ahead: image

![image](https://user-images.githubusercontent.com/5332158/27774100-cd022328-5f60-11e7-9a27-6f3f58b9125e.png)

This highlighted/circled in red scope is the second capture group inside the look ahead, which is being ignored: Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:9: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]

Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:9:  [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:10: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:11: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:12: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:13: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:14: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:15: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:16: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:17: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:18: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
Packages/Amxx Pawn/syntax_test_amxx-pawn.sma:626:19: [meta.preprocessor.AmxxPawn support.function.definition.pawn function.definition.AmxxPawn] does not match scope [source.AmxxPawn source.sma meta.preprocessor.AmxxPawn]
borela commented 7 years ago

I know, this is expected, the look ahead is used just to check ahead not apply scopes, I simplified the test example just to show exactly that. You can see more complex examples here where I used the lookahead to distinguish the function types and later I match the parts in a loop set: declaration.

evandrocoan commented 7 years ago

the look ahead is used just to check ahead not apply scopes

Thanks for explaining. Then it would be a enhancement. I would like the capture groups defined inside a look ahead, to be accepted.

You can see more complex examples here where I used the lookahead to distinguish the function types and later I match the parts in a loop set: declaration.

I do not have problems with look ahead. The problem is syntax I developed is too complex and I am forced to not eat the characters. Hence it is necessary to capture scopes inside the look ahead.

The feature implemented on the syntax it a C preprocessor macro syntax enhanced as Notepad++ implements it.

evandrocoan commented 7 years ago

Adding this enhancement is same as allow to set scopes without eating characters, as I can just put what I do not want to eat and apply scopes inside the look ahead sentence as in:

    - match: '(?=^\s*\#(define\s+(\w+)).*(\/\/)(.*)\n)'
      captures:
        0: meta.preprocessor.AmxxPawn
        2: support.function.definition.pawn function.definition.AmxxPawn
        3: punctuation.definition.comment.AmxxPawn comment.line.double-slash.AmxxPawn
        4: comment.line.double-slash.AmxxPawn

I would be setting scopes, but not eating the characters as I am inside a look ahead. I cannot eat them where I use the look ahead, because I need to eat the new line and its preceding character later by the - match: '[^\\]\n' rule on line:

  1. https://github.com/evandrocoan/SublimeAmxxPawn/blob/a095ec7562707a72cc6eb558dc426b86a6a1bb72/AmxxPawn.sublime-syntax#L268

So I can pop the preprocessor from the stack. This creates this visual behavior as Notepad++ editor does:

Sublime Screenshot: image

Notead++ Screenshot: image

On Sublime I added support to apply colors to numbers and strings, while Notepad++ does not. On the last line the preprocessor is popped out because it does not continues with \.

borela commented 7 years ago

I had similar issues when I developed the syntax for MQL4(which is similar to C++), I solved most of them through loops where I would have many lookaheads and then in the loop ignoring the stuff I didn't want, for example:

variables:
  commonIdentifier: |
    [_[:alpha:]]
    [_[:alnum:]]*

main:
  - match: |
      (?xi)^
      \s*\#(define)
      \s+({{commonIdentifier}})
      \s*(\()
    captures:
      1: keyword.other.preprocessor...
      2: entity.name.function...
      3: punctuation.delimiter.function.parameter.begin...
    set: [
      preprocessor-function-body,
      preprocessor-parameters
    ]
  # Other simpler definitions.
  - match: ^s*\#(define)
    set: some-other-definitions

preprocessor-parameters:
  - match: \s*(\))
    captures:
      1: punctuation.delimiter.function.parameter.end...
    pop: true
  # Match without poping.
  - include: percentage

preprocessor-function-body:
  - meta_scope: meta.preprocessor...
  # Backslash before the end of the line, consume it and continue here.
  - match: \s*\\$
    set: preprocessor-function-body
  # End of the line without backslash, finished.
  - match: \s*$
    pop: true
  # Match these ones without poping.
  - include: string
  - include: numbers
  - include: percentage

In the preprocessor-function-body Everything that is not a string, number or the percentage parameter will be ignored but still get the meta scope applied.

evandrocoan commented 7 years ago

Thanks for sharing it! This was very difficult for me to work with. For now I figured it out a workaround on this which helped. I removed the capturing groups from the look ahead, and rematched them inside the stack. To accomplish it, I pushed to the stack each one of the capture groups I would use, and then pop them off, soon as I apply a group to them.

Basically I changed this, which was not working as it inside the look ahead:

    - match: '(?=^\s*\#(define\s+(\w+)).*(\/\/\/)(.*)\n)'
      captures:
        0: meta.preprocessor.AmxxPawn
        2: support.function.definition.pawn function.definition.AmxxPawn
        3: punctuation.definition.doccomment.AmxxPawn comment.line.triple-slash.AmxxPawn
        4: comment.line.triple-slash.AmxxPawn
      push:
        - meta_scope: meta.preprocessor.AmxxPawn
        ...

Into this with several look aheads and look behinds:

    - match: '(?=^\s*\#(define\s+(\w+))(?:\w+).*(\/\/\/)(.*)\n)'
      push:
        - meta_scope: meta.preprocessor.AmxxPawn
        - match: '(?=^\s*\#(define\s+(\w+)))'
          push:
            - match: '(?<=define)\s+(\w+)'
              captures:
                1: support.function.definition.pawn function.definition.AmxxPawn
              pop: true
        ...

I think your way to popping off the stack is much better than the one I use with match: '[^\\]\n'. At the time I wrote this with match: '[^\\]\n' I was ~on Sublime Text build 3114 which does not had the set function.~ For now I am going to keep it compatible with the old build, but when they release some newer stable build other 3126, I can look into removing the match: '[^\\]\n' stuff and implement it with set.

borela commented 7 years ago

I didn't know the 3114 didn't support set, I'll have to fix my package's readme, thank you :D

evandrocoan commented 7 years ago

Sorry, I was mistaken. I just opened Sublime Text build 3114 and found uses of set on the default packages. Probably I confused it with the clear_scopes was not not available on build 3114.

  1. https://forum.sublimetext.com/t/solved-clear-scopes-not-working-on-sublime-build-3114/22931
deathaxe commented 7 years ago

Sorry, but I find the idea totally weird to have a none consuming pattern which is meant to capture something. This is the reason why it's called lookahead. This means: please regex just have a short look onto what's coming next without doing something with what you find.

evandrocoan commented 7 years ago

Thanks for looking into it. However Sublime Text already do it if I use scope or capture 0. For example:

    - match: '(?=test)
      scope: I.am.matching.but.not.consuming

With capture 0, it also works:

    - match: '(?=test)
      captures:
        0: I.am.matching.but.not.consuming

Also, if you open an online engine and to capture things insode a look ahead, they work correctly:

2017-07-03-15-15-50-1-1

2017-07-03-15-19-26-1

https://regex101.com/r/rBY3Wa/1

FichteFoll commented 7 years ago

Capture 0 doesn't work for me, and I wouldn't expect it to.

If also find the idea of "look-aheads" to assign capture groups bad, because:

  1. look-aheads aren't capturing anything, so nothing should happen with what they captured
  2. you can look-ahead within a match and later have actual matches on the characters you looked at, in which case you could have two capture matches for the same character(s)
  3. even without the above, you could push a context later and then have the same character(s) matched twice
  4. I don't know any regex engine that returns captures for groups within look-aheads, but it appears your online engine does - which did you use?
  5. (you can work around it by just encapsulating the region you want to assign a scope to in another capture group)
evandrocoan commented 7 years ago
  1. look-aheads aren't capturing anything, so nothing should happen with what they captured

When applied they are not eating anything. Which is why I need to use them sometimes. The regex engine itself does not eat character as Sublime Text does. For performance increase Sublime Text mark the matched characters as eaten, so it on the next matches they are ignored. This way the performance is greatly increased as only the not scoped characters are sent to the regex engine.

This feature, is to allow the programmer/syntax writer to tell Sublime Text to not eat some characters, therefore to send them to the regex engine, for more one time.


  1. you can look-ahead within a match and later have actual matches on the characters you looked at, in which case you could have two capture matches for the same character(s)

This is certain a problem which comes with the misuse of this feature. So the main guide line is to not use this, unless you are certain you need it. On which case do you need it? For example:

When I was writing the syntax created the rule match: '[^\\]\n', which is used to pop the context from the stack. The problem with this is that I need the end line \n preceding character not eaten.

Then you can say: Ok, not eat the last character of the line. However I cannot do that otherwise the last character will not have its scope applied. The solution I used for this problem is the following:

    - match: '(?=^\s*\#(define\s+(\w+))(?:\w+).*(\/\/)(.*)\n)'
      push:
        - meta_scope: meta.preprocessor.AmxxPawn
        - match: '(?=^\s*\#(define\s+(\w+)))'
          push:
            - match: '(?<=define)\s+(\w+)'
              captures:
                1: support.function.definition.pawn function.definition.AmxxPawn
              pop: true

I still using look ahead, however now I am recapturing everything with look ahead again, and applying the scopes with meta_scope.


  1. I don't know any regex engine that returns captures for groups within look-aheads, but it appears your online engine does - which did you use?

Well, there is a link at the bottom of the image. If you follow it you will see it:

image

  1. https://regex101.com/r/rBY3Wa/1

  1. (you can work around it by just encapsulating the region you want to assign a scope to in another capture group)

Like that, but is much harder to catch everything again, and may be impossible to catch some other things. But to pop the context from the stack, best is not depend on the rule match: '[^\\]\n' which requires the last line character not consumed by Sublime Text. This was initially pointed out by @keith-hall on the issue:

  1. https://github.com/SublimeTextIssues/Core/issues/1377 The Sublime Text's Syntax Regex is jumping/not catching/ignoring the current line end
FichteFoll commented 7 years ago

Another thing to consider here: Starting to suddenly treat groups in look-aheads as capturing groups will break backwards compatibility for a lot of syntaxes and is most likely not something you want to do.

deathaxe commented 7 years ago

Thanks for looking into it. However Sublime Text already do it if I use scope or capture 0. For example:

Capture 0 is no capture. It is just everything being searched on.

evandrocoan commented 7 years ago

@deathaxe Capture 0 is no capture. It is just everything being searched on.

You are correct, Capture 0 is no capture. But saying It is just everything being searched on makes reference to the whole regex input buffer. I think you meant to say It is just everything being matched by the whole regex expression, including all capturing and non-capturing/captured groups.

The capture 0 is the same thing as the scope operator, it applies the scope to the whole match, not the everything which was searched on. For example, these two examples are equivalents:

    - match: '(looking)'
      captures:
        0: first.looking.assignment

    - match: '(looking)'
      scope: first.looking.assignment

Using any of them, will apply the scope first.looking.assignment to the looking regex:

image

@FichteFoll Starting to suddenly treat groups in look-aheads as capturing groups will break backwards compatibility for a lot of syntaxes and is most likely not something you want to do.

It should be bug fix for the old syntaxes which were defining capture groups inside look head. For example, I have begin doing this for several month, until yesterday when I decide to create a syntax test, to assure the capture groups I was assigning inside the look ahead were actually being scoped. And for my big surprise they were not.

I was thinking they were being assigned because right after the look ahead I as pushing the scope meta.preprocessor with meta_scope , so I though the capture groups as well the meta_scope were being pushed. As we may see on the example bellow:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(?=looking)'
      push:
        - meta_scope: first.looking.assignment
        - match: '(?=looking)'
          push:
            - meta_scope: second.looking.assignment
            - match: '(?=looking)'
              push:
                - meta_scope: third.looking.assignment
                - match: '(?=looking)'
                  push:
                    - match: 'looking'
                      scope: finally.i.am.capturing.it
                      pop: true
                - match: ''
                  pop: true
            - match: ''
              pop: true
        - match: ''
          pop: true

image

Hence if someone else was also using capture groups inside look ahead, they as well would expect they to be captured. Otherwise they would have being using/used (?:non-capture) (non-capturing groups) instead of (capture) (capturing groups).

In case they used capturing groups inside look aheads, when if fact they meant to be using non-capturing groups, then they misused regular expressions. Even if they used capture groups inside look ahead, therefore they will probably not be using the capture clause. For example,

Let us suppose I used capture groups inside a look head, while in fact I was meaning to use non-capturing groups:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(look)(?=(ing))'
      captures:
        1: first.looking.assignment
      push:
        - meta_scope: second.looking.assignment
        - match: ''
          pop: true

image

But now when Sublime Text starts accepting the capture groups inside the look aheads, their matches would still be correctly targeted, because as we may see, the look ahead capture groups do not interfere with the remaining capturing groups, as they need to correctly follow the whole regex sentence sequencing of capture groups, i.e., regex engines to do apply different sequences for capture groups within and outside the look ahead.

So, if you want to refer to a capture group inside or outside a look ahead, you must to follow their sequence, not caring whether they are inside or outside a look ahead. For example:

sublime_shell

  1. https://regex101.com/r/UlNaHg/1

~Also, you need to notice, the regex engine does not accept look ahead inside expressions, only ahead of them. Hence the example just above will never match because the look ahead is on the middle of the regular expression.~


The only thing Sublime Text is doing right now is to treat the capture groups inside the look ahead as non-capture groups. Then, it is only prejudicing the syntaxes which want to assign scopes to the capture groups defined within the look ahead. For example:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(look)(?=(ing))'
      captures:
        1: first.looking.assignment
        2: second.looking.assignment
      push:
        - meta_scope: third.looking.assignment
        - match: ''
          pop: true

sublime_shell

Thought, even PackageDev syntax helper is accepting the capture groups inside the look ahead, as we may see on the image just bellow:

sublime_shell

Now only lasts for Sublime Text do the same.

FichteFoll commented 7 years ago

Thought, even PackageDev syntax helper is accepting the capture groups inside the look ahead, as we may see on the image just bellow:

Thanks for that report, I created an issue at https://github.com/SublimeText/PackageDev/issues/114 for it. This is obviously wrong behavior and just the result of a short-sighted implementation.

I'm not going to respond to the rest of your comment, however, because I believe I've said anything I need to. There are a couple contradicting statements and you generally seem to have a weird understanding of look-ahead groups, so I suggest you look into those a little more closer. Particularly, "Hence the example just above will never match because the look ahead is on the middle of the regular expression." is just wrong. It will never match because the regex itself is contradicting.

evandrocoan commented 7 years ago

is just wrong. It will never match because the regex itself is contradicting

It is wrong because as you said:

the regex itself is contradicting

seem to have a weird understanding of look-ahead groups

Thanks, I looked into the regex definition for look ahead:

Lookaheads are zero-width -- they don't match any characters, they just assert that certain conditions are true at that point in the string. So if the lookahead matches, then the characters after it will be CFU / ML or whatever else your lookahead would match.

https://stackoverflow.com/questions/26935128/regex-match-pattern-after-positive-look-ahead

So, I fixed the example:

sublime_shell

  1. https://regex101.com/r/UlNaHg/2

This is what I am trying to say all along. Capture groups inside look ahead should be respected by Sublime Text. It would be a feature to allow the programmer to not each some characters of the parsed text.

I opened a question on Stackoverflow asking about this:

  1. How to prove capturing groups inside look aheads should or should not be allowed by Sublime Text syntax parser?
evandrocoan commented 7 years ago

After some testings I figured out Sublime Text is accepting capture groups inside look ahead. You can try it out with (look)(?=(ing))\2, which makes Sublime Text to correctly apply scopes for the capture group inside the look ahead:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - match: '(look)(?=(ing))\2'
      captures:
        1: first.looking.assignment
        2: second.looking.assignment
      push:
        - meta_scope: third.looking.assignment
        - match: '(?=ing)'
          scope: forth.looking.assignment
        - match: ''
          pop: true

Sublime text is just not applying any scope to the text if the text is not eaten by the regular expression. Therefore it would change the question. Which should just be to allow apply scopes to not eaten characters.

Rephrasing that there is not problem for backward compatibility because by the regex standard, capture groups inside look ahead/behind are acceptable, therefore if someone was defining them, he was meaning to use them.

Even if they do not used non-capturing groups inside look ahead/behind, it will not do any difference as Sublime Text is correctly reading/accepting the capture groups. It is just not applying the scope if the characters are not consumed. This can be seems from the above example:

sublime_shell

sublime_shell

  1. https://regex101.com/r/UlNaHg/3
FichteFoll commented 7 years ago

which makes Sublime Text to correctly apply scopes for the capture group inside the look ahead:

Interesting case.

I'd rather say this is a bug, because it only occurs under a certain situation. Firstly, the backreference is causing Oniguruma to be used rather than sregex (not a bug, just an observation). Secondly, this only happens when there is a backreference to some group inside a look-ahead, which for some reason is possible, which then probably causes the backreference to elevate the match to a proper capturing group.

wbond commented 7 years ago

This is a known implementation detail. I don't have plans on changing it – Jon and I discussed it last year and decided to keep it as-is. It can be summarized as:

Captures in lookaheads are only applied is the capture is fully consumed by the remainder of the match

Lookaheads that are the end of a pattern will not apply scopes to capture groups because they do not actually consume the characters. Instead, the characters are left unconsumed, and the state of the lexer is changed. The context the lexer is in when it next tries to make a match will start testing patterns against the unconsumed characters. This aspect is pretty foundational to how the engine works. Changing this would break probably almost every default syntax that has been modified in the past year. Not only that, but trying to do as requested could easily lead to tokenization backtracking where a token is created for a match, but then retroactively split and have scopes changed by future matches.

The captures will be applied if the capture group is consumed by the rest of the match. This makes sense since the characters will not be made available for matching to the next context, and the token can be constructed, with scopes applied, and added to the token list for the buffer.

If I recall the correctly, the behavior was implemented this way because it is also how TextMate originally implemented it. If we did anything different, TextMate grammars would break.

deathaxe commented 7 years ago

There is no need to change anything as it is no issue from my point of view. It could even be used as some kind of feature to ease some patterns in special situations and maybe avoid one additional set/push to keep meta scopes clean.

evandrocoan commented 7 years ago

So it could be a new match operator which could be used by new syntaxes as required on very few strict situations. It would allow to capture things on non-consumed characters:

%YAML 1.2
---
name: test
scope: source.syntax_name

contexts:
  main:
    - rematch: '(?=(test it))'
      captures:
        1: i.am.capturing.it.wow
      push:
        - meta_scope: pushing.it.yeah
        - match: 'test'
          scope: other.scope
          pop: true
FichteFoll commented 7 years ago

With a test syntax as follows:

%YAML 1.2
---
name: 'Look-ahead Test'
scope: source.test
contexts:
  main:
    - match: '(a)(b)(?=((c)(d)))(...)(f)'
      captures:
        1: a
        2: b
        3: cd.look-ahead
        4: c.look-ahead
        5: d.look-ahead
        6: cde
        7: f

it can be observed that the scope name for capture group 6 is not applied. I presume this is because it has already been matched in the look-ahead.

Following @wbond's comment from earlier, I conclude that this feature is already implemented partially, and the other part is not going to change, because it would break backwards compatibility. Closing as wontfix.

I am surprised that capture groups within look-aheads actually do capture, though.

wbond commented 7 years ago

I am surprised that capture groups within look-aheads actually do capture, though.

I was also, but the implementation was necessary to keep compatibility with TextMate. When discussing Jon and I determined that things would break if we removed it.

Originally I had in mind to make some optimizations to the lexer to drop all of the work related to capturing patterns if they were in a lookahead, however that isn't possible without qualifications due to this feature.