Rewrite of internal regex generation.

matter123 commented 5 years ago

The following example generates the tag meta.specifier.linkage.$14.cpp, however, there are only 13 capture groups

cpp_grammar[:extern_linkage_specifier] = extern_linkage_specifier = newPattern(
    match: std_space.then(
        match: /extern/,
        tag_as: "storage.type.extern"
    ).then(std_space).maybe(
        # This doesn't match the spec as the spec says any string literal is
        # allowed in a linkage specification. However, the spec also says that
        # linkage specification strings should match the name that is being linked.
        # It is unlikely that a language would have a double quote or newline
        # in its name.
        match: newPattern(
            match: /\"/,
            tag_as: "punctuation.definition.string.begin"
        ).then(
            match: zeroOrMoreOf(/[^\"]/),
            reference: "linkage"
        ).then(
            match: /\"/,
            tag_as: "punctuation.definition.string.end"
        ),
        tag_as: "string.quoted.double.extern"
    ),
    tag_as: "meta.specifier.linkage.$reference(linkage)"
)

According to regex 101 the correct group should have been group 12 https://regex101.com/r/doQVjJ/1

jeff-hykin commented 5 years ago

yeah this is going to be a pain to fix and verify

jeff-hykin commented 5 years ago

What if, instead of processing things immediately, and producing a big regex value, the value was just saved.

Current behavior

value = /hello/.then(/world/) value >>> /hello(?:world)/

New Behavior

value = /hello/.then(/world/) value >>> /hello/ and value.next_pattern >>> { type: "then", arguments: /world/ }

Basically it waits, then once the .to_tag method is called it can iterate down the chain of calls, give each one a group number, and produce the final regex string. This would also make it possible to verify that references actually exist when they are called.

The arguments for the next pattern could also be stored in the hash. For example:

value = /hello/.then(
    match: /world/,
    at_least: 1.times
)

value >>> /hello/ and value.next_pattern >>>

{ 
    type: "then",
    arguments: {
        match: /world/,
        at_least: 1.times
    },
}

matter123 commented 5 years ago

So value.then(/foo/) would follow the linked list of next_pattern until nil was found, then place /foo/ there, correct?

jeff-hykin commented 5 years ago

I actually hadn't considered that, but yeah that's a good solution.

jeff-hykin commented 5 years ago

On the internals side, all of the helpers (both then and .then forms) would only be doing the linked list insertion of their name and arguments. Then there would just be the to_tag method that handles the actual conversion logic. Pretty much everything else would be a helper function for to_tag.

matter123 commented 5 years ago

What are your thoughts of or optionally taking an array of patterns? The english-ness is reduced some but it fixes the issue of a.or(b).or(c).then(d) actually becoming a|b|cd.

Alternatively, this could be provided as .oneOf([...]).

jeff-hykin commented 5 years ago

I think the oneOf() would be great 👍it would also fix the very ugly indentation of

/a/.or(
    /b/
).or(
    /c/
)

matter123 commented 5 years ago

To reiterate, the pattern class would have the following members:

type - the type of the pattern (then|or|backref|subcall)
- This is most likely going to be used by the to_* methods (to_tag, to_s, to_r), is there a reason this can't be done through inheritance?
next_pattern - This forms a linked list of pattern objects
arguments - a copy of the arguments passed in
to_tag - converts the pattern to a textmate pattern
to_s - converts the pattern to its canonical ruby source representation

Additionally, I want to propose the following members

start_pattern - a reference to itself
- This is so linters/sanity checkers can treat pattern ranges and patterns the same
to_r - converts this pattern to a Regexp
name - arguments.reference || arguments.tag_as || to_s
- used for displaying an issue with this pattern

Potentially PatternRange could be a subclass of Pattern.

start_pattern and end_pattern would return their respective patterns
to_r would raise an error preventing the use of a pattern range in a pattern

jeff-hykin commented 5 years ago

I was thinking of modifying the Regexp class to keep the //.then() syntax. But it would be more professional to have a legit Pattern class. Pattern.new() isn't too bad. So I guess yeah lets go with an actual Pattern class. I guess we could always make all the //.then()'s return a Pattern object instead of a regex object.

I think the start_pattern, to_r, and name are great. I think to_s will be preferable for debugging, but it would still be good to have a name attribute. Making PatternRange as a subclass would be great too. It could make a lot of things more simple.

jeff-hykin commented 5 years ago

I'd probably also add a regex attribute, and let arguments be all of the other options. so the //.then(/thing/) would have no arguments, but would have regex same with then(match: /thing/), no arguments, but would have regex

class Pattern
    @regex
    @type
    @arguments
    @next_regex

    def name
    end
    def to_tag
    end
    def to_s
    end
    def to_r
    end
    def start_pattern
    end
end

matter123 commented 5 years ago

my concern with the regex attribute is that patterns can be constructed from other patterns

a = Pattern.new(
  match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)

what would a.regex be?

Edit: is regex a renaming of match? So a.regex would be a reference to the zeroOrMoreOf(...) ?

matter123 commented 5 years ago

I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of .then() is supported. You can find it in source/pattern.rb of the Tool/misc/pattern branch.

Example output

regex:
/(abc)(?:def)/

tag:
{:match=>"(abc)(?:def)", :captures=>nil, :name=>"abc"}

name:
abc

canonical form:
Pattern.new(
  match: /abc/,
  tag_as: abc,
  reference: abc,
).then(
  match: /def/,
)

puts "regex:"
puts test.to_r.inspect
puts "\ntag:"
puts test.to_tag
puts "\nname:"
puts test.name
puts "\ncanonical form:"
puts test

Edit: The structure of the Pattern class is starting to take shape. then (basic) and maybe chaining functions are supported. Adding a new expression type can be really simple (13 lines for maybe). It has support for rendering a Pattern as a string. The Pattern constructor raises an exception when passed in a Regexp with capture groups.

jeff-hykin commented 5 years ago

my concern with the regex attribute is that patterns can be constructed from other patterns
a = Pattern.new(
  match: zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times),
)

I missed this message. I think the regex would be the to_r of zeroOrMoreOf(/\*/).then(match: /&/, at_most: 2.times)

jeff-hykin commented 5 years ago

I took the class skeleton you provided and made a very basic proof of concept. Captures are not yet implemented and only the plain form of .then() is supported. You can find it in source/pattern.rb of the Tool/misc/pattern branch.

Thank you! I'll check it out

jeff-hykin commented 5 years ago

Wow you did a ton of work on Tool/misc/pattern! That debugging output looks fantastic. Lets just develop the full solution on that branch.

What I can do is change all of the newPattern's into Pattern.new on a separate brach and merge it into master. Then the Tool/misc/pattern can be tested until it performs the same as the old method. At that point it can seamlessly replace the old method.

matter123 commented 5 years ago

In the next hour I should have a rewritten capture method to hopefully allow for sub expressions to work.

In particular the recursiveness of the design makes determining final group numbers impossible without carraying some capture info along the recursive call chain.

matter123 commented 5 years ago

Backreference support has been added.

matter123 commented 5 years ago

Removed the != nil on a bunch of checks per https://rubystyle.guide/#no-non-nil-checks, seperated the concept of group info and capture hash.

I haven't figured out why but the following snippet makes an odd output

test = Pattern.new(
    match: Pattern.new(/abc/).then(match: /aaa/, tag_as: "aaa"),
    tag_as: "abc",
    reference: "abc"
).maybe(/def/).then(
    match: /ghi/,
    tag_as: "ghi"
).lookAheadFor(/jkl/).matchResultOf("abc")

test2 = test.then(/mno/)
puts test2

Output:

Pattern.new(
  match: Pattern.new(
      match: /abc/,
    ).then(
      match: /aaa/,
      tag_as: aaa,
    ),
  tag_as: abc,
  reference: abc,
).maybe(
  match: /def/,
).then(
  match: /ghi/,
  tag_as: ghi,
).lookAround(
  match: /jkl/,
  type: :lookAheadFor,
).matchResultOf("{:backreference_key=>"abc", :match=>/[:backreference:abc:]/}").then(
  match: /mno/,
)

Edit: this was because the __deep_clone__ method passes in a hash and the initialize function was not expecting that

jeff-hykin commented 5 years ago

Thanks for using the style guide, I actually do prefer the nil checks for 2 reasons though.
1. I want false to pass the test, e.g. I only wanted to check for existence not truthy-ness.
2. if variable requires you to know what the language considers as "truthy", and that varies quite a bit from language to language. Ruby considers 0 to be true, which most programmers probably wouldn't expect. Its also non-obvious or hard to intuitively remember if empty lists, empty strings, or empty objects are considered "truthy" in Ruby. I prefer the != nil since I think it tells people exactly whats happening even if they don't know Ruby. Very few people would expect 0 == nil, except maybe Brendan Eich.

I don't recommend this to you, but, if I could enforce it on myself, I'd probably actually prefer if variable == true and if variable == false despite its redundant nature.

However, it would probably be best to get rid of almost all of the != nil in favor of actual .is_a?(Type) checks to make sure the data is the structure we were looking for instead of just not nil.

matter123 commented 5 years ago

So if @arguments[:tag_as] would become if @arguments[:tag_as].is_a? String ?

jeff-hykin commented 5 years ago

Hmm that is a good question. The validity of @arguments[:tag_as] (e.g. checking that its a string) would be checked at the initializer. Having two String checks would be redundant, and if for some reason we wanted to change it, we'd then have to change the check in two places and one would be easy to forget and would cause an edge case bug.

I'd say thats actually a good place to do a != nil check, since we only want to make sure that the value exists

jeff-hykin commented 5 years ago

Also, if you want to get rid of the optimization of the outer group, I think thats fine. I doubt it provides a performance benefit, I originally did it so that Ruby could generate JSON patterns that looked more like they were hand-written. The optimization group adds complexity, and is probably the reason this issue #339 was caused in the first place. If you don't think its a big deal to keep it though thats fine with me too

matter123 commented 5 years ago

It already exists in the pattern class and having the group attributes carry around what their group number is, minimizes a lot of the risk of forgetting to pass into some helper was_first_group_removed.

jeff-hykin commented 5 years ago

I wasn't expecting all of the MaybePattern < Pattern classes, but I think thats a very clean way of doing it by overriding the do_modify_regex and do_get_to_s_name, great thinking 👍

jeff-hykin commented 5 years ago

Oh and btw, the :repository_name does not need to be carried over. Thats something that should probably not exist.

jeff-hykin commented 5 years ago

This is genius, I was just planning on using the remove_numbered_capture_groups

jeff-hykin commented 5 years ago

I think for the MaybePattern we can do similar to what you've done with the LookAroundPattern

Quantifiers and groups can be multiple separate options, grouped, not-grouped, capturing group, non-capturing group, atomic group, non-atomic, and then all possible quantifier permutations.

I think we can handle all of that logic in one place, and the sub pattern-class, like MaybePattern can just set those options and save its output type for debugging.

jeff-hykin commented 5 years ago

for the @regex, lets change it to @match which will make more readable sense. We can also add a to_r method to the Regexp class so that we don't have to check the type e.g. we can just call @match.to_r and get the regex form

we can change @next_regex to @next_match as well

matter123 commented 5 years ago

Sounds good.

Maybe @next_pattern? That is always a pattern.

jeff-hykin commented 5 years ago

oh, if its always a pattern then yeah @next_pattern would be best

jeff-hykin commented 5 years ago

I'll go ahead and make that change 👍

matter123 commented 5 years ago

Regexp#to_r should probably take (and ignore) groups

jeff-hykin commented 5 years ago

Hey I changed part of the way the to_r works in order to accomodate the .or() pattern and keep it's logic by itself. Basically do_modify_regex() handles the @match, and @next_pattern.integrate_regex() handles the @match + @next_pattern

(integrate_regex() isn't a great name, but I can't think of a better one)

I also changed the regex_to_s() into a .to_r_s method that exists on Regexp

I also changed the atomic? into is_single_entity?. I still don't think is_single_entity? is a good name, but I wanted to add it to the Regexp class and having a atomic? on a Regexp class that doesn't correlate with regex-atomicness would've been worse. Maybe singular? is a better name

jeff-hykin commented 5 years ago

The changes are all merged with your changes (e.g. PatternRange), I just wanted to give you a summary

matter123 commented 5 years ago

Maybe is_single_group? and append_pattern ?

jeff-hykin commented 5 years ago

is_single_group is a bit misleading since single-characters are also valid, even though they're not a group.

it is kind of the reverse of append_pattern. More like the_item.append_self_into(the_list)

jeff-hykin commented 5 years ago

okay, I fixed the integrate_regex() since the default value was messed up. I also added the .or() pattern class. I think I'm done for today, so now you don't have to worry about me messing anything up ¯_(ツ)_/¯

Feel free to rename things as you see fit

jeff-hykin commented 5 years ago

I thought this re-write would take months, I'm really impressed with how much progress has been finished.

the only major things Pattern needs is

~reTag()~
~runTests()~
~replace $match and $reference()~
:word_cannot_be_any_of
~dont_backtrack?~
as_few_as_possible
pattern linting (matches nothing, newline, empty, common prefix)

For the Grammar object, I think we can add a .to_tag method on the Strings class, Hash class, Array class, and Symbol class so that the Grammar object can just call to_tag and have all of them handle the recursion. The grammar object can basically do that first (convert the whole grammar into JSON/Hash) and then it can handle converting the $initial_context into $self or $base.

That's pretty much all that needs to be done to the Library.

matter123 commented 5 years ago

To_tag also needs to convert includes to patterns

jeff-hykin commented 5 years ago

Yeah, includes should be an Array, so the Array will just have a recursive to_tag method

matter123 commented 5 years ago

$match and $reference now resolve to the correct captures.

matter123 commented 5 years ago

Should there be a change in the behavior of should_not_fully_match? Quoting contributing.md

should_not_fully_match asserts that the pattern does not match all the >characters in the test strings. note: should_not_fully_match does not imply should_partial_match, that is, a failure to match satisfies should_not_fully_match

jeff-hykin commented 5 years ago

yeah the wording is a little confusing. If it fully matches, then the test should fail, all other cases should pass. I'm not sure if the code enforces that or not

matter123 commented 5 years ago

This is how I implemented the should_not_fully_match unit test

if @arguments[:should_not_fully_match].is_a? Array
    test_regex = /^(?:#{self_regex})$/
    if @arguments[:should_not_fully_match].none? {|test| test =~ test_regex} == false
        warn(:should_not_fully_match)
    end
end

jeff-hykin commented 5 years ago

I'm actually not sure how .none works, but I'll check it out next time I'm in the code.

matter123 commented 5 years ago

foo.none? {|value| condition} is equivlent to !foo.any? {|value| condition} and foo.all? {|value| !condition}

matter123 commented 5 years ago

You have Regexp#is_single_entity? return nil. What is the advantage of nil over false?

matter123 commented 5 years ago

Regexp#is_single_entity? now walks over the regex to determine if an arbitrary regex is, in fact, a single entity.

It works by keeping track of the current nesting "depth"

[ and ( increment the depth
] and ) decrement the depth
sets prevent ( and ) from changing the depth
escape sequences prevent all 4 characters from changing the depth
if the depth is ever 0 at any point other than the beginning of the string, return false
if, in the end, depth, escaped?, or in_set? are not their expected values, assume parsing error, display a warning and return false.
otherwise, return true.

matter123 commented 5 years ago

We should investigate using named groups. This could potentially eliminate all forms of looping through the groups to match reference names to group numbers.

matter123 commented 5 years ago

Well node-oniguruma's findNextMatch does not allow index groups by group name. As a result, calculating the group number is something that needs to happen anyway.

Though it may still be useful for backreference and subexp as those are parsed by the regex engine and therefore do support named groups.

jeff-hykin / better-cpp-syntax

Rewrite of internal regex generation. #339

Current behavior

New Behavior