jeff-hykin / better-cpp-syntax

💾 The source of VS Code's C++ syntax highlighting
GNU General Public License v3.0
155 stars 30 forks source link

Rewrite of internal regex generation. #339

Open matter123 opened 5 years ago

matter123 commented 5 years ago

The following example generates the tag meta.specifier.linkage.$14.cpp, however, there are only 13 capture groups

cpp_grammar[:extern_linkage_specifier] = extern_linkage_specifier = newPattern(
    match: std_space.then(
        match: /extern/,
        tag_as: "storage.type.extern"
    ).then(std_space).maybe(
        # This doesn't match the spec as the spec says any string literal is
        # allowed in a linkage specification. However, the spec also says that
        # linkage specification strings should match the name that is being linked.
        # It is unlikely that a language would have a double quote or newline
        # in its name.
        match: newPattern(
            match: /\"/,
            tag_as: "punctuation.definition.string.begin"
        ).then(
            match: zeroOrMoreOf(/[^\"]/),
            reference: "linkage"
        ).then(
            match: /\"/,
            tag_as: "punctuation.definition.string.end"
        ),
        tag_as: "string.quoted.double.extern"
    ),
    tag_as: "meta.specifier.linkage.$reference(linkage)"
)

According to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1

matter123 commented 5 years ago

What are your thoughts on PatternRage accepting :tag_{start,end,while}_as to avoid having to wrap the start/end/while pattern in Pattern.new just to tag the entirety of it.

This would be implemented by:

if @arguments[:tag_start_as]
    @start_pattern = Pattern.new(match: @start_pattern, tag_as: @arguments[:tag_start_as])
end

repeat for end and while.

matter123 commented 5 years ago

From #127

Another missing feature is when the start_pattern has a capture group the end pattern can have a back reference to that capture group. See c++ raw string literals.

After doing some digging backreferences in end_pattern and while_pattern Always refer to the start_pattern groups. This means, that and end or while pattern cannot rematch a part of itself.

Using start_pattern's groups in the to_r for end_pattern and while_pattern will generate appropriate backreference numbers to the start_patterns group.

The next issue is often that the referenced group number will be higher than the end pattern has groups. One somewhat ugly solution to this is to pass a flag to to_r that causes it to generate () sequences for each member of the groups array. #to_tag would need to clean that up after converting the regex to a string.

Alternatively, we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp. Afaik the only method on Pattern that actually uses the fact that it is a Regexp is #runTests. I think a lot of the conversion is because #to_r currently has two jobs:

  1. "compile" the pattern into a single object
  2. convert this single object into a Regexp.

I think a lot could be simplified (including backreferences in end/while) if the two jobs were separated into separate sections and #to_r was only called when a Regexp is actually needed.

In that end, I propose adding a method: #evaluate - evaluates the pattern into a string suitable for inserting into a grammar or constructing a Regexp.

Additionally, the behavior of #to_r is modified such that the definition of #to_r is equivalent to

def to_r(*args) Regexp.new(evaluate(*args)) end

This split can be accomplished without change to the rest of the methods in the Pattern and related classes, as the visible behavior of to_r has not changed.

This should hopefully eliminate the vast majority of use cases for Regexp#to_r_s

matter123 commented 5 years ago

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

This is how I imagined it working:

  1. If @at_most and @at_least are nil do nothing
  2. If the pattern contains reference: complain as this cannot be rewritten
  3. If the pattern does not have tag_as: do nothing
  4. If the pattern contains includes: either complain as includes: gets used for this or rewrite so that a new pattern wraps this one and includes is moved there
  5. @arguments[:includes] = [ @match ]
  6. delete @arguments[:tag_as]

The checks for tag_as and reference would need to be recursive but not includes as only the top-level includes gets rewritten.

Optionally @match could be rewritten to not capture anything internally, but That is unlikely to actually cause an issue and removing capturing groups is error-prone.

One potential issue to consider is that the inner pattern may not have appropriate boundaries.

Additionally, it may be desired to behave this way, so there should be an option to disable it.

For example, tagging the last member of a member chain is made easier.

Pattern.new(
  match: Pattern.new(match: /\w+/, tag_as: "last.member).maybe(/\./),
  at_least 1.times,
)
jeff-hykin commented 5 years ago

You have Regexp#is_single_entity? return nil. What is the advantage of nil over false?

Nil = unknown False = known for-sure to not be a single entity True = known for-sure to be a single entity

Your recent changes might have made the nil category irrelevant though

matter123 commented 5 years ago

Yes, but what is the advantage of saying that it's is unknown if it is a single entity. The only sane thing you can do when nil is returned is to wrap it.

jeff-hykin commented 5 years ago

Yes, but what is the advantage of saying that it's is unknown if it is a single entity. The only sane thing you can do when nil is returned is to wrap it.

Debugging, and allowing for complex checking logic. If its nil, then maybe run further (computationally intensive) checks, if its false/true then just leave it.

matter123 commented 5 years ago

Fair enough, if a Pattern returns nil it could do #to_r#is_single_entity? To figure it out.

jeff-hykin commented 5 years ago

What are your thoughts on PatternRage accepting :tag_{start,end,while}_as to avoid having to wrap the start/end/while pattern in Pattern.new just to tag the entirety of it.

This would be implemented by:

if @arguments[:tag_start_as]
  @start_pattern = Pattern.new(match: @start_pattern, tag_as: @arguments[:tag_start_as])
end

repeat for end and while.

I think the Pattern.new is nice because it is explicit about what is allowed as a value. When someone is reading code, they know they can put any pattern variable in that position. If its a hash, they might try to do some weird thing like convert a pattern to a hash and then pass the hash as an argument to the pattern range.

That's my thought process at least, I'm not 100% convinced it's always the best way, since it's pretty verbose.

jeff-hykin commented 5 years ago

This means, that and end or while pattern cannot rematch a part of itself.

Wow, what an edgecase. Good job finding that out.

matter123 commented 5 years ago

Under that proposal start_pattern: still needs to be a (or convertible to) a Pattern. Another argument that can be passed to PatternRange.new() is tag_start_as:, which is equivalent to the start_pattern having that for its tag_as

Example:

PatternRange.new(
  start_pattern: /#if/,
  tag_start_as: "keyword.control.preprocessor.directive.conditional.if",
  end_pattern: /#endif/,
  tag_start_as: "keyword.control.preprocessor.directive.conditional.endif",
)
jeff-hykin commented 5 years ago

From #127

Another missing feature is when the start_pattern has a capture group the end pattern can have a back reference to that capture group. See c++ raw string literals.

After doing some digging backreferences in end_pattern and while_pattern Always refer to the start_pattern groups. This means, that and end or while pattern cannot rematch a part of itself.

Using start_pattern's groups in the to_r for end_pattern and while_pattern will generate appropriate backreference numbers to the start_patterns group.

The next issue is often that the referenced group number will be higher than the end pattern has groups. One somewhat ugly solution to this is to pass a flag to to_r that causes it to generate () sequences for each member of the groups array. #to_tag would need to clean that up after converting the regex to a string.

Alternatively, we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp. Afaik the only method on Pattern that actually uses the fact that it is a Regexp is #runTests. I think a lot of the conversion is because #to_r currently has two jobs:

  1. "compile" the pattern into a single object
  2. convert this single object into a Regexp.

I think a lot could be simplified (including backreferences in end/while) if the two jobs were separated into separate sections and #to_r was only called when a Regexp is actually needed.

In that end, I propose adding a method: #evaluate - evaluates the pattern into a string suitable for inserting into a grammar or constructing a Regexp.

Additionally, the behavior of #to_r is modified such that the definition of #to_r is equivalent to

def to_r(*args) Regexp.new(evaluate(*args)) end

This split can be accomplished without change to the rest of the methods in the Pattern and related classes, as the visible behavior of to_r has not changed.

This should hopefully eliminate the vast majority of use cases for Regexp#to_r_s

Yeah, due to the backreference causing the failed compilation, I agree we'll need to change things up.

we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp

I agree. Since runTests is basically the only time actual Regexp is needed, I think we should take regex and convert it into a string as part of taking in arguments for Pattern objects. (Basically convert all regex to regex-strings as soon as possible). Then just keep it consistent that all other functions should expect regex-strings instead of actual regex. This will probably make the code cleaner too.

The #evaluate sounds like a good idea to me, but I'll need a bit more time to remind myself of how the code works before I can give a useful opinion though.

jeff-hykin commented 5 years ago

Under that proposal start_pattern: still needs to be a (or convertible to) a Pattern. Another argument that can be passed to PatternRange.new() is tag_start_as:, which is equivalent to the start_pattern having that for its tag_as

Example:

PatternRange.new(
  start_pattern: /#if/,
  tag_start_as: "keyword.control.preprocessor.directive.conditional.if",
  end_pattern: /#endif/,
  tag_start_as: "keyword.control.preprocessor.directive.conditional.endif",
)

Man that example looks really nice, and there are tons of places it could be used. I think that is probably the better solution, especially since it would be consistent that the start/end pattern still needs to be a (or convertible to) a Pattern.

matter123 commented 5 years ago

Yeah, that sounds reasonable: So for initialize:

  1. if arguments is not a hash arguments = {match: arguments}
  2. if arguments[:match] is a String, @match = Regexp.escape arguments[:match]
  3. else if arguments[:match] is a Regexp @match = arguments[:match].to_r_s
  4. else if arguments[:match] is a Pattern @match = arguments[:match]
  5. else raise "unexpected match argument"
  6. delete arguments[:match]
  7. @arguments = arguments
jeff-hykin commented 5 years ago

Yeah, that sounds reasonable: So for initialize:

  1. if arguments is not a hash arguments = {match: arguments}
  2. if arguments[:match] is a String, @match = Regexp.escape arguments[:match]
  3. else if arguments[:match] is a Regexp @match = arguments[:match].to_r_s
  4. else if arguments[:match] is a Pattern @match = arguments[:match]
  5. else raise "unexpected match argument"
  6. delete arguments[:match]
  7. @arguments = arguments

yeah 👍 . So long as there's still a @original_arguments that saves the exact original input, all those other steps look perfect to me

jeff-hykin commented 5 years ago

check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()

This is how I imagined it working:

  1. If @at_most and @at_least are nil do nothing
  2. If the pattern contains reference: complain as this cannot be rewritten
  3. If the pattern does not have tag_as: do nothing
  4. If the pattern contains includes: either complain as includes: gets used for this or rewrite so that a new pattern wraps this one and includes is moved there
  5. @arguments[:includes] = [ @match ]
  6. delete @arguments[:tag_as]

The checks for tag_as and reference would need to be recursive but not includes as only the top-level includes gets rewritten.

Optionally @match could be rewritten to not capture anything internally, but That is unlikely to actually cause an issue and removing capturing groups is error-prone.

One potential issue to consider is that the inner pattern may not have appropriate boundaries.

Additionally, it may be desired to behave this way, so there should be an option to disable it.

For example, tagging the last member of a member chain is made easier.

Pattern.new(
  match: Pattern.new(match: /\w+/, tag_as: "last.member).maybe(/\./),
  at_least 1.times,
)

Yup this is perfect too. I didn't think about that external $reference() 's would fail.

matter123 commented 5 years ago

So in the effort of fixing tags with oneOrMoreOf, I propose 3 additional public methods:

When initializing a then pattern with quantifiers, if @match.groupless? evaluates to true, the proscribed steps in https://github.com/jeff-hykin/cpp-textmate-grammar/issues/339#issuecomment-524001750 are preformed and @match = @match.groupless. Otherwise, there is no change to @match.

In particular, I want to make the pattern groupless rather than detect for reference and complain, because: A) I'm pretty sure groups are slow; and B) complaining on reference, prohibits the use of local references.

So the option to disable, rather than disabling the entire transformation, just disables the match.groupless transformation.

matter123 commented 5 years ago

To minimize the amount of converting The free function is_string_single_entity? (feel free to change the name) is the same thing that was in Regexp#is_single_entity?

Other renames: #do_self_generate_regex -> #do_self_evaluate #integrate_regex -> #integrate_pattern

matter123 commented 5 years ago

So the next issue with auto fixing up oneOrMoreOf is backreferences.

Even if the backreference is internal to the pattern that is being looped, when it is made groupless it no longer compiles.

I will do further testing to see how back-references behave when quantified.

Edit: unsurprisingly it works exactly as expected, backreferences correctly match what was matched earlier.

So what I want to do now is when a pattern is quantified all of the tag_as is removed, and then the reference names are scrambled, this prohibits accidentally using the reference outside of the pattern that is quantified. The disabling would then disable the scrambling.

matter123 commented 5 years ago

The only place Regexp#to_r_s is called now is:

matter123 commented 5 years ago

Feel free to revert or change the last commit.

In the last commit, I added a rubocop.yml file and made the rewrite compliant.

In the process I uncovered and fixed several bugs:

Sorry for the somewhat large commit.

matter123 commented 5 years ago

PatternRange spends a lot of code doing duplicate actions to both @end_pattern and @while_pattern. This additionally comes with a lot of nil checks, both of these make the class much larger than it needs to be.

As a PatternRange is only ever allowed to have either @end_pattern or @while_pattern (never both, never neither). I am going to rewrite it to store a @start_pattern and @termination_pattern and store whether it's a while or end for when it matters (to_s, to_tag).

matter123 commented 5 years ago

I thought this re-write would take months, I'm really impressed with how much progress has been finished.

the only major things Pattern needs is

  • reTag()

  • runTests()

  • ~replace $match and $reference()~

  • :word_cannot_be_any_of

  • ~dont_backtrack?~

  • as_few_as_possible

  • pattern linting (matches nothing, newline, empty, common prefix)

For the Grammar object, I think we can add a .to_tag method on the Strings class, Hash class, Array class, and Symbol class so that the Grammar object can just call to_tag and have all of them handle the recursion. The grammar object can basically do that first (convert the whole grammar into JSON/Hash) and then it can handle converting the $initial_context into $self or $base.

That's pretty much all that needs to be done to the Library.

I am going to get raTag and maybe :word_cannot_be_any_of in today.

When should tests and linting be done?:

I'm thinking the latter as:

Some disadvantages:

Those could be mitigated by having the linters register what the key to disable them is, something like: Pattern.registerPreLinter matches_nothing. :zero_length_match?

Thinking some more this might be a good way to support transformations: something like: Pattern.registerPreTransformation wordCannotBeAnyOf :word_cannot_be_any_of Pattern.registerPreTransormation unHoistIncludePre Pattern.registerPostTransformation unHoistIncludePost

So wordCannotBeAnyOf is a pre-evaluate transform (that is it is part of the regex generation) and accepts one argument unHoistIncludePre is a pre-evaluate transform that accepts no args unHoistIncludePost is a post-to_tag transform that accepts no args

I do think regardless of if there is registration for transform, there needs to be a way for multiple functions to be a part of the self_evaluate.

You have:

If a subpattern like OneOf wants to override one part, it has to reimplement everything else.

@jeff-hykin Thoughts?

jeff-hykin commented 5 years ago

I am going to get raTag and maybe :word_cannot_be_any_of in today.

Great! Sorry I haven't been able to look over the other stuff, I really appreciate your work on this.

When should tests and linting be done?:

I think they should be part of some kind of final step, like when attached to the grammar grammar[:repo_name] = (so inside the def []=(key) function). That way the pattern should be complete/finalized enough to run unit tests, and the unit tests would throw errors even if they were in an external/imported location.

matter123 commented 5 years ago

Ok, so Grammer#[]= would execute param.run_test() and param.run_pre_linters() if param is a Pattern.

jeff-hykin commented 5 years ago

For tests, maybe it would be better to change the import/export method to this:

grammar = Grammar.new_exportable_grammar
grammar[:repo_name] = Pattern.new()

# performs test, lints, and other checks here
grammar.export([ :repo_name ])

Then just have all tests done when a grammar is exported (either to another grammar or to a JSON file). That way, like you said, whole-grammar tests: like referencing non-existant repo names, can be run at the same time as unit tests and linters.

matter123 commented 5 years ago

Yeah, that sounds good.

jeff-hykin commented 5 years ago

For the pre/post transformations, disabling, and evaluation I'm probably going to have to read the code more to remind myself and get a good opinion, so just use your best judgement.

I think the pre-transformations will probably need to be done when the constructor is called, and then the post transformations, like backreferences, will have to be done near the same time as tests.

There's the possibility that some patterns will contain references to non-existant things, and then that incomplete pattern will later get attached/copied/retag-ed and become valid.

At some point I'll look over the is_single_entity? and groups stuff too

matter123 commented 5 years ago

I imagined that post-transformations would be used for transformations that require the entire tag, like duplicating capture groups to support atom, or rewriting an include that contains ^, so by necessity they would run after a tag for the pattern was formed.

This is all still a pretty rough idea though.

matter123 commented 5 years ago

I'm dropping support for retagging by group number as that information is no longer free, and it goes against the idea that this library promotes of not caring about the group number.

matter123 commented 5 years ago

So I am imagining that word_cannot_be rewrites @match to be Pattern.new(word_cannot_be_patetrn).then(@match), and the original @match is stored to hide this.

That should allow for all patterns to support it, without having to support it in each sub_patterns self_evaluate.

matter123 commented 5 years ago

Pattern#to_tag now processes includes:. I am not sure if they should show up in Pattern#to_s as I feel that that could make the output rather long.

matter123 commented 5 years ago

I also changed the atomic? into is_single_entity?. I still don't think is_single_entity? is a good name, but I wanted to add it to the Regexp class and having a atomic? on a Regexp class that doesn't correlate with regex-atomicness would've been worse. Maybe singular? is a better name

While Regexp#single_entity? still exists, I do not think it is actually called anywhere if you want to change the name back to atomic?.

jeff-hykin commented 5 years ago

I'm dropping support for retagging by group number as that information is no longer free, and it goes against the idea that this library promotes of not caring about the group number.

👍 yeah we shouldn't need them, they're fluid

jeff-hykin commented 5 years ago

Pattern#to_tag now processes includes:. I am not sure if they should show up in Pattern#to_s as I feel that that could make the output rather long.

In the #to_s, just have them output either the symbol/string, or the pattern.name if its not a symbol/string

matter123 commented 5 years ago

There now exists a method to register linters and transforms, see source/rewrite/linters/start_match_empty.rb for an example. The linters and transforms are called by save_to line#133 of grammar.rb.

matter123 commented 5 years ago

So attempting to rewrite some of the source/cpp/lib to be compatible with the new import/export syntax, I feel that Grammar#[] should return some sort of PlaceholderPattern that can be used to include a pattern that doesn't exist yet.

This would add the additional method Pattern#resolve(grammar) which in all patterns recursively calls #resolve on @match and @next_pattern and on placeholder patterns use grammar to resolve the placeholder and populate @match, then calling #resolve on @next_match

Edit: added. See gem/test/test_placeholder_pattern.rb

matter123 commented 5 years ago

Most unsafe methods from Pattern have been removed and replaced by Pattern#map! (the only other remaining bang method is Pattern#insert!.

I've been thinking that it might be desirable automatically freeze any patterns.

Consider:

a = Pattern.new(/abc/)
b = Pattern.new(/bcd/).then(a)
a.match = /def/

b now matches /bcddef/. To prevent this any method that changes a pattern actually changes a clone. Freezing the patterns could prevent the incorrect use of insert! or map! (or a user-provided method).

@jeff-hykin Thoughts?

matter123 commented 5 years ago

The rewrite is finally to the point where the initial problem code compiles and functions correctly (see /test.rb).

source/languages/cpp/lib/{inline_comment,std_space} have been rewritten to use the new export syntax if you wanto to see that.

I went ahead and made any method that returns a copy of Pattern call freeze on the copy.

At this point, I think it is feature complete enough to attempt to get the main cpp grammar working. At that point, the gem should be mature enough for a pre-release publish.

jeff-hykin commented 5 years ago

The rewrite is finally to the point where the initial problem code compiles and functions correctly (see /test.cpp).

That is fantastic, I can't wait to take a look.

matter123 commented 5 years ago

As the gem textmate_grammar doesn't exists in rubygems.org you will likely need to use bundle exec to run it. (bundle exec ruby test.rb)

matter123 commented 5 years ago

I'm thinking that Pattern should be split in two, Pattern and PatternBase. PatternBase contains behavior common to all patterns while Pattern handles quantifiers. All subpatterns would then inherit from PatternBase.

This has the advantage of making it more explicit that most special pattern types don't support quantifiers.

There would be no change to the public API, all non be special patterns will just be quantified to match exactly once. PatternBase.new could optionally be made private to prevent accidental used of it.

As fixing the oneof tagas problem is looking increasing complicated it will help to let most subclasses not have to deal with it.

jeff-hykin commented 5 years ago

Could we split it into Pattern and QuantifiedPattern? Then by default quantification can be off and QuantifiedPattern can be used in the other cases

matter123 commented 5 years ago

Sure, the odd name was to preserve the ability to use Pattern with quantifiers.

jeff-hykin commented 5 years ago

Ah, you're right thats a big oversight I missed.

jeff-hykin commented 5 years ago

I think SingularPattern might be a better name for PatternBase but it's still not great. Use whatever name you like though, since it's internal it shouldn't be an issue to change later

matter123 commented 5 years ago

the last assertion in the just pushed test for or is failing, as I think or is improperly wrapping.

Edit: The issue is that OrPattern#integrate_pattern calls self.evaluate() which causes any chained patterns to be included inside of the alternation shy group.

matter123 commented 5 years ago

Potential names for the base class:

other options:

matter123 commented 5 years ago

There should be a transform so that a quantified pattern can detect when @match is quantified to avoid spamming the console with: warning: nested repeat operator '+' and '?' was replaced with '*' in regular expression

if outer is (0, 1), or (0, nil):
  if inner.need_to_capture?
    no changes
  if inner is (1, m):
    outer now is (1, 1)
    inner now is (0, m)
  if inner is (0, m):
    outer now is (1, 1)
  if inner is (l, m):
    no changes
if outer is (1, nil):
  if inner is (0, m) or (1, m):
    outer now is (1, 1)
  if inner is (l, m)
    now changes
if outer is (a, a)
  if inner.need_to_capture?
    no changes
  if inner is (b, b)
    outer is (1, 1)
    inner is (a*b, a*b)

no changes

l and m represent any possible value for @at_least and @at_most

Edit: this doesn't solve anything and suppressing the warnings in the few places where an actual Regexp is used is far simpler.

matter123 commented 5 years ago

some_number_of_angle_brackets brings up an interesting point of how tests are done. Are unit tests testing the entire pattern after this point or just match:

In the below pattern should the test fail?

Pattern.new(
  match: oneOrMoreOf("a"),
  should_fully_match: ["aa"],
).oneOrMoreOf("b")

If the answer is no, should unit testing for match: re able to reference the outer pattern? That is, should the below test pass?

Pattern.new(
    should_fully_match: ["(func1 arg1 (func2 arg2 arg3 (func3 ( func4 arg4))))"],
    match: Pattern.new("(").then(/\s*/).then(/\w+/).zeroOrMoreOf(
        Pattern.new(/\s+/).oneOf(
            [
                Pattern.new(/\w+/),
                recursivelyMatch("func_call"),
            ],
        ),
    ).then(/\s*/).then(")"),
    reference: "func_call",
)
matter123 commented 5 years ago

The rewrite is now able to generate the cpp grammar and all unit tests and 2/3rds of the integration tests pass.