Open matter123 opened 5 years ago
yeah this is going to be a pain to fix and verify
What if, instead of processing things immediately, and producing a big regex value, the value was just saved.
value = /hello/.then(/world/)
value >>> /hello(?:world)/
value = /hello/.then(/world/)
value >>> /hello/
and value.next_pattern >>> { type: "then", arguments: /world/ }
Basically it waits, then once the .to_tag
method is called it can iterate down the chain of calls, give each one a group number, and produce the final regex string. This would also make it possible to verify that references actually exist when they are called.
The arguments for the next pattern could also be stored in the hash. For example:
value = /hello/.then(
match: /world/,
at_least: 1.times
)
value >>> /hello/
and value.next_pattern >>>
{
type: "then",
arguments: {
match: /world/,
at_least: 1.times
},
}
So value.then(/foo/)
would follow the linked list of next_pattern until nil was found, then place /foo/ there, correct?
I actually hadn't considered that, but yeah that's a good solution.
On the internals side, all of the helpers (both then
and .then
forms) would only be doing the linked list insertion of their name and arguments. Then there would just be the to_tag
method that handles the actual conversion logic. Pretty much everything else would be a helper function for to_tag
.
What are your thoughts of or
optionally taking an array of patterns? The english-ness is reduced some but it fixes the issue of a.or(b).or(c).then(d)
actually becoming a|b|cd
.
Alternatively, this could be provided as .oneOf([...])
.
I think the oneOf() would be great 👍it would also fix the very ugly indentation of
/a/.or(
/b/
).or(
/c/
)
To reiterate, the pattern class would have the following members:
type
- the type of the pattern (then|or|backref|subcall)
next_pattern
- This forms a linked list of pattern objectsarguments
- a copy of the arguments passed into_tag
- converts the pattern to a textmate patternto_s
- converts the pattern to its canonical ruby source representationAdditionally, I want to propose the following members
start_pattern
- a reference to itself
to_r
- converts this pattern to a Regexpname
- arguments.reference || arguments.tag_as || to_s
Potentially PatternRange could be a subclass of Pattern.
start_pattern
and end_pattern
would return their respective patternsto_r
would raise an error preventing the use of a pattern range in a patternI was thinking of modifying the Regexp class to keep the //.then()
syntax. But it would be more professional to have a legit Pattern class. Pattern.new() isn't too bad. So I guess yeah lets go with an actual Pattern class. I guess we could always make all the //.then()
's return a Pattern object instead of a regex object.
I think the start_pattern
, to_r
, and name
are great. I think to_s
will be preferable for debugging, but it would still be good to have a name attribute. Making PatternRange as a subclass would be great too. It could make a lot of things more simple.
I'd probably also add a regex
attribute, and let arguments
be all of the other options.
so the //.then(/thing/)
would have no arguments, but would have regex
same with then(match: /thing/)
, no arguments, but would have regex
class Pattern
@regex
@type
@arguments
@next_regex
def name
end
def to_tag
end
def to_s
end
def to_r
end
def start_pattern
end
end
my concern with the regex attribute is that patterns can be constructed from other patterns
a = Pattern.new(
match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)
what would a.regex
be?
Edit: is regex
a renaming of match
? So a.regex would be a reference to the zeroOrMoreOf(...)
?
I took the class skeleton you provided and made a very basic proof of concept.
Captures are not yet implemented and only the plain form of .then()
is supported. You can find it in source/pattern.rb
of the Tool/misc/pattern
branch.
Example output
regex:
/(abc)(?:def)/
tag:
{:match=>"(abc)(?:def)", :captures=>nil, :name=>"abc"}
name:
abc
canonical form:
Pattern.new(
match: /abc/,
tag_as: abc,
reference: abc,
).then(
match: /def/,
)
puts "regex:"
puts test.to_r.inspect
puts "\ntag:"
puts test.to_tag
puts "\nname:"
puts test.name
puts "\ncanonical form:"
puts test
Edit: The structure of the Pattern class is starting to take shape. then (basic) and maybe chaining functions are supported. Adding a new expression type can be really simple (13 lines for maybe). It has support for rendering a Pattern as a string. The Pattern constructor raises an exception when passed in a Regexp with capture groups.
my concern with the regex attribute is that patterns can be constructed from other patterns
a = Pattern.new( match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times), )
I missed this message. I think the regex would be the to_r
of zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times)
I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of
.then()
is supported. You can find it insource/pattern.rb
of theTool/misc/pattern
branch.
Thank you! I'll check it out
Wow you did a ton of work on Tool/misc/pattern
! That debugging output looks fantastic. Lets just develop the full solution on that branch.
What I can do is change all of the newPattern
's into Pattern.new
on a separate brach and merge it into master. Then the Tool/misc/pattern
can be tested until it performs the same as the old method. At that point it can seamlessly replace the old method.
In the next hour I should have a rewritten capture method to hopefully allow for sub expressions to work.
In particular the recursiveness of the design makes determining final group numbers impossible without carraying some capture info along the recursive call chain.
Backreference support has been added.
Removed the != nil
on a bunch of checks per https://rubystyle.guide/#no-non-nil-checks, seperated the concept of group info and capture hash.
I haven't figured out why but the following snippet makes an odd output
test = Pattern.new(
match: Pattern.new(/abc/).then(match: /aaa/, tag_as: "aaa"),
tag_as: "abc",
reference: "abc"
).maybe(/def/).then(
match: /ghi/,
tag_as: "ghi"
).lookAheadFor(/jkl/).matchResultOf("abc")
test2 = test.then(/mno/)
puts test2
Output:
Pattern.new(
match: Pattern.new(
match: /abc/,
).then(
match: /aaa/,
tag_as: aaa,
),
tag_as: abc,
reference: abc,
).maybe(
match: /def/,
).then(
match: /ghi/,
tag_as: ghi,
).lookAround(
match: /jkl/,
type: :lookAheadFor,
).matchResultOf("{:backreference_key=>"abc", :match=>/[:backreference:abc:]/}").then(
match: /mno/,
)
Edit: this was because the __deep_clone__
method passes in a hash and the initialize function was not expecting that
Thanks for using the style guide, I actually do prefer the nil checks for 2 reasons though.
1. I want false
to pass the test, e.g. I only wanted to check for existence not truthy-ness.
2. if variable
requires you to know what the language considers as "truthy", and that varies quite a bit from language to language. Ruby considers 0
to be true, which most programmers probably wouldn't expect. Its also non-obvious or hard to intuitively remember if empty lists, empty strings, or empty objects are considered "truthy" in Ruby. I prefer the != nil
since I think it tells people exactly whats happening even if they don't know Ruby. Very few people would expect 0 == nil, except maybe Brendan Eich.
I don't recommend this to you, but, if I could enforce it on myself, I'd probably actually prefer if variable == true
and if variable == false
despite its redundant nature.
However, it would probably be best to get rid of almost all of the != nil
in favor of actual .is_a?(Type)
checks to make sure the data is the structure we were looking for instead of just not nil.
So if @arguments[:tag_as]
would become if @arguments[:tag_as].is_a? String
?
Hmm that is a good question. The validity of @arguments[:tag_as] (e.g. checking that its a string) would be checked at the initializer. Having two String checks would be redundant, and if for some reason we wanted to change it, we'd then have to change the check in two places and one would be easy to forget and would cause an edge case bug.
I'd say thats actually a good place to do a != nil check, since we only want to make sure that the value exists
Also, if you want to get rid of the optimization of the outer group, I think thats fine. I doubt it provides a performance benefit, I originally did it so that Ruby could generate JSON patterns that looked more like they were hand-written. The optimization group adds complexity, and is probably the reason this issue #339 was caused in the first place. If you don't think its a big deal to keep it though thats fine with me too
It already exists in the pattern class and having the group attributes carry around what their group number is, minimizes a lot of the risk of forgetting to pass into some helper was_first_group_removed
.
I wasn't expecting all of the MaybePattern < Pattern
classes, but I think thats a very clean way of doing it by overriding the do_modify_regex
and do_get_to_s_name
, great thinking 👍
Oh and btw, the :repository_name
does not need to be carried over. Thats something that should probably not exist.
This is genius, I was just planning on using the remove_numbered_capture_groups
I think for the MaybePattern
we can do similar to what you've done with the LookAroundPattern
Quantifiers and groups can be multiple separate options, grouped, not-grouped, capturing group, non-capturing group, atomic group, non-atomic, and then all possible quantifier permutations.
I think we can handle all of that logic in one place, and the sub pattern-class, like MaybePattern
can just set those options and save its output type for debugging.
for the @regex
, lets change it to @match
which will make more readable sense. We can also add a to_r
method to the Regexp class so that we don't have to check the type e.g. we can just call @match.to_r
and get the regex form
we can change @next_regex
to @next_match
as well
Sounds good.
Maybe @next_pattern? That is always a pattern.
oh, if its always a pattern then yeah @next_pattern
would be best
I'll go ahead and make that change 👍
Regexp#to_r should probably take (and ignore) groups
Hey I changed part of the way the to_r
works in order to accomodate the .or()
pattern and keep it's logic by itself. Basically do_modify_regex()
handles the @match
, and @next_pattern.integrate_regex()
handles the @match + @next_pattern
(integrate_regex()
isn't a great name, but I can't think of a better one)
I also changed the regex_to_s() into a .to_r_s
method that exists on Regexp
I also changed the atomic?
into is_single_entity?
. I still don't think is_single_entity?
is a good name, but I wanted to add it to the Regexp class and having a atomic?
on a Regexp class that doesn't correlate with regex-atomicness would've been worse. Maybe singular?
is a better name
The changes are all merged with your changes (e.g. PatternRange), I just wanted to give you a summary
Maybe is_single_group? and append_pattern ?
is_single_group is a bit misleading since single-characters are also valid, even though they're not a group.
it is kind of the reverse of append_pattern. More like the_item.append_self_into(the_list)
okay, I fixed the integrate_regex()
since the default value was messed up. I also added the .or()
pattern class. I think I'm done for today, so now you don't have to worry about me messing anything up ¯_(ツ)_/¯
Feel free to rename things as you see fit
I thought this re-write would take months, I'm really impressed with how much progress has been finished.
the only major things Pattern needs is
reTag()
~runTests()
~$match
and $reference()
~:word_cannot_be_any_of
dont_backtrack?
~as_few_as_possible
For the Grammar object, I think we can add a .to_tag method on the Strings class, Hash class, Array class, and Symbol class so that the Grammar object can just call to_tag
and have all of them handle the recursion. The grammar object can basically do that first (convert the whole grammar into JSON/Hash) and then it can handle converting the $initial_context into $self or $base.
That's pretty much all that needs to be done to the Library.
To_tag also needs to convert includes to patterns
Yeah, includes should be an Array, so the Array will just have a recursive to_tag method
$match and $reference now resolve to the correct captures.
Should there be a change in the behavior of should_not_fully_match
?
Quoting contributing.md
should_not_fully_match asserts that the pattern does not match all the >characters in the test strings. note: should_not_fully_match does not imply should_partial_match, that is, a failure to match satisfies should_not_fully_match
yeah the wording is a little confusing. If it fully matches, then the test should fail, all other cases should pass. I'm not sure if the code enforces that or not
This is how I implemented the should_not_fully_match
unit test
if @arguments[:should_not_fully_match].is_a? Array
test_regex = /^(?:#{self_regex})$/
if @arguments[:should_not_fully_match].none? {|test| test =~ test_regex} == false
warn(:should_not_fully_match)
end
end
I'm actually not sure how .none
works, but I'll check it out next time I'm in the code.
foo.none? {|value| condition}
is equivlent to !foo.any? {|value| condition}
and foo.all? {|value| !condition}
You have Regexp#is_single_entity?
return nil
. What is the advantage of nil over false?
Regexp#is_single_entity?
now walks over the regex to determine if an arbitrary regex is, in fact, a single entity.
It works by keeping track of the current nesting "depth"
[
and (
increment the depth]
and )
decrement the depth(
and )
from changing the depthdepth
, escaped?
, or in_set?
are not their expected values, assume parsing error, display a warning and return false.We should investigate using named groups. This could potentially eliminate all forms of looping through the groups to match reference names to group numbers.
Well node-oniguruma's findNextMatch does not allow index groups by group name. As a result, calculating the group number is something that needs to happen anyway.
Though it may still be useful for backreference and subexp as those are parsed by the regex engine and therefore do support named groups.
The following example generates the tag
meta.specifier.linkage.$14.cpp
, however, there are only 13 capture groupsAccording to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1