Open matter123 opened 5 years ago
What are your thoughts on PatternRage accepting :tag_{start,end,while}_as
to avoid having to wrap the start/end/while pattern in Pattern.new
just to tag the entirety of it.
This would be implemented by:
if @arguments[:tag_start_as]
@start_pattern = Pattern.new(match: @start_pattern, tag_as: @arguments[:tag_start_as])
end
repeat for end and while.
From #127
Another missing feature is when the start_pattern has a capture group the end pattern can have a back reference to that capture group. See c++ raw string literals.
After doing some digging backreferences in end_pattern
and while_pattern
Always refer to the start_pattern
groups. This means, that and end or while pattern cannot rematch a part of itself.
Using start_pattern
's groups
in the to_r for end_pattern
and while_pattern
will generate appropriate backreference numbers to the start_patterns group.
The next issue is often that the referenced group number will be higher than the end pattern has groups. One somewhat ugly solution to this is to pass a flag to to_r that causes it to generate ()
sequences for each member of the groups
array. #to_tag would need to clean that up after converting the regex to a string.
Alternatively, we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp. Afaik the only method on Pattern
that actually uses the fact that it is a Regexp is #runTests
. I think a lot of the conversion is because #to_r
currently has two jobs:
I think a lot could be simplified (including backreferences in end/while) if the two jobs were separated into separate sections and #to_r was only called when a Regexp is actually needed.
In that end, I propose adding a method:
#evaluate
- evaluates the pattern into a string suitable for inserting into a grammar or constructing a Regexp.
Additionally, the behavior of #to_r is modified such that the definition of #to_r is equivalent to
def to_r(*args) Regexp.new(evaluate(*args)) end
This split can be accomplished without change to the rest of the methods in the Pattern and related classes, as the visible behavior of to_r has not changed.
This should hopefully eliminate the vast majority of use cases for Regexp#to_r_s
check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()
This is how I imagined it working:
@at_most
and @at_least
are nil do nothingreference:
complain as this cannot be rewrittenincludes:
either complain as includes:
gets used for this or rewrite so that a new pattern wraps this one and includes is moved there@arguments[:includes] = [ @match ]
delete @arguments[:tag_as]
The checks for tag_as and reference would need to be recursive but not includes
as only the top-level includes
gets rewritten.
Optionally @match could be rewritten to not capture anything internally, but That is unlikely to actually cause an issue and removing capturing groups is error-prone.
One potential issue to consider is that the inner pattern may not have appropriate boundaries.
Additionally, it may be desired to behave this way, so there should be an option to disable it.
For example, tagging the last member of a member chain is made easier.
Pattern.new(
match: Pattern.new(match: /\w+/, tag_as: "last.member).maybe(/\./),
at_least 1.times,
)
You have
Regexp#is_single_entity?
returnnil
. What is the advantage of nil over false?
Nil = unknown False = known for-sure to not be a single entity True = known for-sure to be a single entity
Your recent changes might have made the nil category irrelevant though
Yes, but what is the advantage of saying that it's is unknown if it is a single entity. The only sane thing you can do when nil is returned is to wrap it.
Yes, but what is the advantage of saying that it's is unknown if it is a single entity. The only sane thing you can do when nil is returned is to wrap it.
Debugging, and allowing for complex checking logic. If its nil, then maybe run further (computationally intensive) checks, if its false/true then just leave it.
Fair enough, if a Pattern returns nil
it could do #to_r#is_single_entity?
To figure it out.
What are your thoughts on PatternRage accepting
:tag_{start,end,while}_as
to avoid having to wrap the start/end/while pattern inPattern.new
just to tag the entirety of it.This would be implemented by:
if @arguments[:tag_start_as] @start_pattern = Pattern.new(match: @start_pattern, tag_as: @arguments[:tag_start_as]) end
repeat for end and while.
I think the Pattern.new is nice because it is explicit about what is allowed as a value. When someone is reading code, they know they can put any pattern variable in that position. If its a hash, they might try to do some weird thing like convert a pattern to a hash and then pass the hash as an argument to the pattern range.
That's my thought process at least, I'm not 100% convinced it's always the best way, since it's pretty verbose.
This means, that and end or while pattern cannot rematch a part of itself.
Wow, what an edgecase. Good job finding that out.
Under that proposal start_pattern:
still needs to be a (or convertible to) a Pattern. Another argument that can be passed to PatternRange.new() is tag_start_as:
, which is equivalent to the start_pattern having that for its tag_as
Example:
PatternRange.new(
start_pattern: /#if/,
tag_start_as: "keyword.control.preprocessor.directive.conditional.if",
end_pattern: /#endif/,
tag_start_as: "keyword.control.preprocessor.directive.conditional.endif",
)
From #127
Another missing feature is when the start_pattern has a capture group the end pattern can have a back reference to that capture group. See c++ raw string literals.
After doing some digging backreferences in
end_pattern
andwhile_pattern
Always refer to thestart_pattern
groups. This means, that and end or while pattern cannot rematch a part of itself.Using
start_pattern
'sgroups
in the to_r forend_pattern
andwhile_pattern
will generate appropriate backreference numbers to the start_patterns group.The next issue is often that the referenced group number will be higher than the end pattern has groups. One somewhat ugly solution to this is to pass a flag to to_r that causes it to generate
()
sequences for each member of thegroups
array. #to_tag would need to clean that up after converting the regex to a string.Alternatively, we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp. Afaik the only method on
Pattern
that actually uses the fact that it is a Regexp is#runTests
. I think a lot of the conversion is because#to_r
currently has two jobs:
- "compile" the pattern into a single object
- convert this single object into a Regexp.
I think a lot could be simplified (including backreferences in end/while) if the two jobs were separated into separate sections and #to_r was only called when a Regexp is actually needed.
In that end, I propose adding a method:
#evaluate
- evaluates the pattern into a string suitable for inserting into a grammar or constructing a Regexp.Additionally, the behavior of #to_r is modified such that the definition of #to_r is equivalent to
def to_r(*args) Regexp.new(evaluate(*args)) end
This split can be accomplished without change to the rest of the methods in the Pattern and related classes, as the visible behavior of to_r has not changed.
This should hopefully eliminate the vast majority of use cases for
Regexp#to_r_s
Yeah, due to the backreference causing the failed compilation, I agree we'll need to change things up.
we could reevaluate why in the creation of a pattern, the pattern components are repeatedly converted to and from a Regexp
I agree. Since runTests is basically the only time actual Regexp is needed, I think we should take regex and convert it into a string as part of taking in arguments for Pattern objects. (Basically convert all regex to regex-strings as soon as possible). Then just keep it consistent that all other functions should expect regex-strings instead of actual regex. This will probably make the code cleaner too.
The #evaluate
sounds like a good idea to me, but I'll need a bit more time to remind myself of how the code works before I can give a useful opinion though.
Under that proposal
start_pattern:
still needs to be a (or convertible to) a Pattern. Another argument that can be passed to PatternRange.new() istag_start_as:
, which is equivalent to the start_pattern having that for its tag_asExample:
PatternRange.new( start_pattern: /#if/, tag_start_as: "keyword.control.preprocessor.directive.conditional.if", end_pattern: /#endif/, tag_start_as: "keyword.control.preprocessor.directive.conditional.endif", )
Man that example looks really nice, and there are tons of places it could be used. I think that is probably the better solution, especially since it would be consistent that the start/end pattern still needs to be a (or convertible to) a Pattern.
Yeah, that sounds reasonable: So for initialize:
arguments
is not a hash arguments = {match: arguments}
arguments[:match]
is a String, @match = Regexp.escape arguments[:match]
arguments[:match]
is a Regexp @match = arguments[:match].to_r_s
arguments[:match]
is a Pattern @match = arguments[:match]
raise "unexpected match argument"
delete arguments[:match]
@arguments = arguments
Yeah, that sounds reasonable: So for initialize:
- if
arguments
is not a hasharguments = {match: arguments}
- if
arguments[:match]
is a String,@match = Regexp.escape arguments[:match]
- else if
arguments[:match]
is a Regexp@match = arguments[:match].to_r_s
- else if
arguments[:match]
is a Pattern@match = arguments[:match]
- else
raise "unexpected match argument"
delete arguments[:match]
@arguments = arguments
yeah 👍 . So long as there's still a @original_arguments
that saves the exact original input, all those other steps look perfect to me
check/fix the tag_as: in oneOrMoreOf() or zeroOrMoreOf()
This is how I imagined it working:
- If
@at_most
and@at_least
are nil do nothing- If the pattern contains
reference:
complain as this cannot be rewritten- If the pattern does not have tag_as: do nothing
- If the pattern contains
includes:
either complain asincludes:
gets used for this or rewrite so that a new pattern wraps this one and includes is moved there@arguments[:includes] = [ @match ]
delete @arguments[:tag_as]
The checks for tag_as and reference would need to be recursive but not
includes
as only the top-levelincludes
gets rewritten.Optionally @match could be rewritten to not capture anything internally, but That is unlikely to actually cause an issue and removing capturing groups is error-prone.
One potential issue to consider is that the inner pattern may not have appropriate boundaries.
Additionally, it may be desired to behave this way, so there should be an option to disable it.
For example, tagging the last member of a member chain is made easier.
Pattern.new( match: Pattern.new(match: /\w+/, tag_as: "last.member).maybe(/\./), at_least 1.times, )
Yup this is perfect too. I didn't think about that external $reference()
's would fail.
So in the effort of fixing tags with oneOrMoreOf, I propose 3 additional public methods:
#groupless?
- returns true if the regex string as generated by #evaluate
contains no capturing groups#groupless!
- removes any capturing group information from itself and the linked list of subsequent patterns, foo.groupless!.groupless?
Should always evaluate to true. returns self.#groupless
equivalent to __deep_clone.groupless!
When initializing a then pattern with quantifiers, if @match.groupless?
evaluates to true, the proscribed steps in https://github.com/jeff-hykin/cpp-textmate-grammar/issues/339#issuecomment-524001750 are preformed and @match = @match.groupless
. Otherwise, there is no change to @match
.
In particular, I want to make the pattern groupless rather than detect for reference and complain, because: A) I'm pretty sure groups are slow; and B) complaining on reference, prohibits the use of local references.
So the option to disable, rather than disabling the entire transformation, just disables the match.groupless transformation.
To minimize the amount of converting The free function is_string_single_entity?
(feel free to change the name) is the same thing that was in Regexp#is_single_entity?
Other renames:
#do_self_generate_regex
-> #do_self_evaluate
#integrate_regex
-> #integrate_pattern
So the next issue with auto fixing up oneOrMoreOf is backreferences.
Even if the backreference is internal to the pattern that is being looped, when it is made groupless it no longer compiles.
I will do further testing to see how back-references behave when quantified.
Edit: unsurprisingly it works exactly as expected, backreferences correctly match what was matched earlier.
So what I want to do now is when a pattern is quantified all of the tag_as is removed, and then the reference names are scrambled, this prohibits accidentally using the reference outside of the pattern that is quantified. The disabling would then disable the scrambling.
The only place Regexp#to_r_s
is called now is:
Regexp#is_single_entity?
- passes the work of to is_string_single_entity?
Regexp#evaluate
Regexp#integrate_pattern
Pattern#initalize
- when arg1[:match].is_a? Regexp
Feel free to revert or change the last commit.
In the last commit, I added a rubocop.yml file and made the rewrite compliant.
In the process I uncovered and fixed several bugs:
Sorry for the somewhat large commit.
PatternRange spends a lot of code doing duplicate actions to both @end_pattern
and @while_pattern
. This additionally comes with a lot of nil checks, both of these make the class much larger than it needs to be.
As a PatternRange is only ever allowed to have either @end_pattern
or @while_pattern
(never both, never neither). I am going to rewrite it to store a @start_pattern
and @termination_pattern
and store whether it's a while
or end
for when it matters (to_s, to_tag).
I thought this re-write would take months, I'm really impressed with how much progress has been finished.
the only major things Pattern needs is
reTag()
runTests()
~replace
$match
and$reference()
~
:word_cannot_be_any_of
~
dont_backtrack?
~
as_few_as_possible
pattern linting (matches nothing, newline, empty, common prefix)
For the Grammar object, I think we can add a .to_tag method on the Strings class, Hash class, Array class, and Symbol class so that the Grammar object can just call
to_tag
and have all of them handle the recursion. The grammar object can basically do that first (convert the whole grammar into JSON/Hash) and then it can handle converting the $initial_context into $self or $base.That's pretty much all that needs to be done to the Library.
I am going to get raTag and maybe :word_cannot_be_any_of
in today.
When should tests and linting be done?:
#to_tag
; orI'm thinking the latter as:
Some disadvantages:
Those could be mitigated by having the linters register what the key to disable them is, something like:
Pattern.registerPreLinter matches_nothing. :zero_length_match?
Thinking some more this might be a good way to support transformations:
something like:
Pattern.registerPreTransformation wordCannotBeAnyOf :word_cannot_be_any_of
Pattern.registerPreTransormation unHoistIncludePre
Pattern.registerPostTransformation unHoistIncludePost
So wordCannotBeAnyOf
is a pre-evaluate transform (that is it is part of the regex generation) and accepts one argument
unHoistIncludePre
is a pre-evaluate transform that accepts no args
unHoistIncludePost
is a post-to_tag transform that accepts no args
I do think regardless of if there is registration for transform, there needs to be a way for multiple functions to be a part of the self_evaluate.
You have:
@match
If a subpattern like OneOf wants to override one part, it has to reimplement everything else.
@jeff-hykin Thoughts?
I am going to get raTag and maybe :word_cannot_be_any_of in today.
Great! Sorry I haven't been able to look over the other stuff, I really appreciate your work on this.
When should tests and linting be done?:
I think they should be part of some kind of final step, like when attached to the grammar grammar[:repo_name] =
(so inside the def []=(key)
function). That way the pattern should be complete/finalized enough to run unit tests, and the unit tests would throw errors even if they were in an external/imported location.
Ok, so Grammer#[]=
would execute param.run_test()
and param.run_pre_linters()
if param is a Pattern.
For tests, maybe it would be better to change the import/export method to this:
grammar = Grammar.new_exportable_grammar
grammar[:repo_name] = Pattern.new()
# performs test, lints, and other checks here
grammar.export([ :repo_name ])
Then just have all tests done when a grammar is exported (either to another grammar or to a JSON file). That way, like you said, whole-grammar tests: like referencing non-existant repo names, can be run at the same time as unit tests and linters.
Yeah, that sounds good.
For the pre/post transformations, disabling, and evaluation I'm probably going to have to read the code more to remind myself and get a good opinion, so just use your best judgement.
I think the pre-transformations will probably need to be done when the constructor is called, and then the post transformations, like backreferences, will have to be done near the same time as tests.
There's the possibility that some patterns will contain references to non-existant things, and then that incomplete pattern will later get attached/copied/retag-ed and become valid.
At some point I'll look over the is_single_entity?
and groups stuff too
I imagined that post-transformations would be used for transformations that require the entire tag, like duplicating capture groups to support atom, or rewriting an include that contains ^
, so by necessity they would run after a tag for the pattern was formed.
This is all still a pretty rough idea though.
I'm dropping support for retagging by group number as that information is no longer free, and it goes against the idea that this library promotes of not caring about the group number.
So I am imagining that word_cannot_be
rewrites @match
to be Pattern.new(word_cannot_be_patetrn).then(@match)
, and the original @match
is stored to hide this.
That should allow for all patterns to support it, without having to support it in each sub_patterns self_evaluate.
Pattern#to_tag
now processes includes:
. I am not sure if they should show up in Pattern#to_s
as I feel that that could make the output rather long.
I also changed the atomic? into is_single_entity?. I still don't think is_single_entity? is a good name, but I wanted to add it to the Regexp class and having a atomic? on a Regexp class that doesn't correlate with regex-atomicness would've been worse. Maybe singular? is a better name
While Regexp#single_entity?
still exists, I do not think it is actually called anywhere if you want to change the name back to atomic?
.
I'm dropping support for retagging by group number as that information is no longer free, and it goes against the idea that this library promotes of not caring about the group number.
👍 yeah we shouldn't need them, they're fluid
Pattern#to_tag
now processesincludes:
. I am not sure if they should show up inPattern#to_s
as I feel that that could make the output rather long.
In the #to_s, just have them output either the symbol/string, or the pattern.name if its not a symbol/string
There now exists a method to register linters and transforms, see source/rewrite/linters/start_match_empty.rb
for an example. The linters and transforms are called by save_to line#133 of grammar.rb.
So attempting to rewrite some of the source/cpp/lib
to be compatible with the new import/export syntax, I feel that Grammar#[]
should return some sort of PlaceholderPattern
that can be used to include a pattern that doesn't exist yet.
This would add the additional method Pattern#resolve(grammar)
which in all patterns recursively calls #resolve
on @match
and @next_pattern
and on placeholder patterns use grammar to resolve the placeholder and populate @match
, then calling #resolve
on @next_match
Edit: added. See gem/test/test_placeholder_pattern.rb
Most unsafe methods from Pattern have been removed and replaced by Pattern#map!
(the only other remaining bang method is Pattern#insert!
.
I've been thinking that it might be desirable automatically freeze any patterns.
Consider:
a = Pattern.new(/abc/)
b = Pattern.new(/bcd/).then(a)
a.match = /def/
b
now matches /bcddef/
. To prevent this any method that changes a pattern actually changes a clone. Freezing the patterns could prevent the incorrect use of insert!
or map!
(or a user-provided method).
@jeff-hykin Thoughts?
The rewrite is finally to the point where the initial problem code compiles and functions correctly (see /test.rb
).
source/languages/cpp/lib/{inline_comment,std_space}
have been rewritten to use the new export syntax if you wanto to see that.
I went ahead and made any method that returns a copy of Pattern call freeze on the copy.
At this point, I think it is feature complete enough to attempt to get the main cpp grammar working. At that point, the gem should be mature enough for a pre-release publish.
The rewrite is finally to the point where the initial problem code compiles and functions correctly (see /test.cpp).
That is fantastic, I can't wait to take a look.
As the gem textmate_grammar
doesn't exists in rubygems.org you will likely need to use bundle exec
to run it. (bundle exec ruby test.rb
)
I'm thinking that Pattern should be split in two, Pattern and PatternBase. PatternBase contains behavior common to all patterns while Pattern handles quantifiers. All subpatterns would then inherit from PatternBase.
This has the advantage of making it more explicit that most special pattern types don't support quantifiers.
There would be no change to the public API, all non be special patterns will just be quantified to match exactly once. PatternBase.new could optionally be made private to prevent accidental used of it.
As fixing the oneof tagas problem is looking increasing complicated it will help to let most subclasses not have to deal with it.
Could we split it into Pattern
and QuantifiedPattern
? Then by default quantification can be off and QuantifiedPattern can be used in the other cases
Sure, the odd name was to preserve the ability to use Pattern
with quantifiers.
Ah, you're right thats a big oversight I missed.
I think SingularPattern
might be a better name for PatternBase
but it's still not great. Use whatever name you like though, since it's internal it shouldn't be an issue to change later
the last assertion in the just pushed test for or is failing, as I think or is improperly wrapping.
Edit: The issue is that OrPattern#integrate_pattern
calls self.evaluate()
which causes any chained patterns to be included inside of the alternation shy group.
Potential names for the base class:
other options:
SubPatternBase
could be inserted in between sub-patterns and Pattern which disables any quantifying features.There should be a transform so that a quantified pattern can detect when @match
is quantified to avoid spamming the console with:
warning: nested repeat operator '+' and '?' was replaced with '*' in regular expression
if outer is (0, 1), or (0, nil):
if inner.need_to_capture?
no changes
if inner is (1, m):
outer now is (1, 1)
inner now is (0, m)
if inner is (0, m):
outer now is (1, 1)
if inner is (l, m):
no changes
if outer is (1, nil):
if inner is (0, m) or (1, m):
outer now is (1, 1)
if inner is (l, m)
now changes
if outer is (a, a)
if inner.need_to_capture?
no changes
if inner is (b, b)
outer is (1, 1)
inner is (a*b, a*b)
no changes
l and m represent any possible value for @at_least
and @at_most
Edit: this doesn't solve anything and suppressing the warnings in the few places where an actual Regexp is used is far simpler.
some_number_of_angle_brackets
brings up an interesting point of how tests are done. Are unit tests testing the entire pattern after this point or just match:
In the below pattern should the test fail?
Pattern.new(
match: oneOrMoreOf("a"),
should_fully_match: ["aa"],
).oneOrMoreOf("b")
If the answer is no, should unit testing for match:
re able to reference the outer pattern? That is, should the below test pass?
Pattern.new(
should_fully_match: ["(func1 arg1 (func2 arg2 arg3 (func3 ( func4 arg4))))"],
match: Pattern.new("(").then(/\s*/).then(/\w+/).zeroOrMoreOf(
Pattern.new(/\s+/).oneOf(
[
Pattern.new(/\w+/),
recursivelyMatch("func_call"),
],
),
).then(/\s*/).then(")"),
reference: "func_call",
)
The rewrite is now able to generate the cpp grammar and all unit tests and 2/3rds of the integration tests pass.
The following example generates the tag
meta.specifier.linkage.$14.cpp
, however, there are only 13 capture groupsAccording to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1