Feature/regular expression metafunction

hsutter / cppfront

A personal experimental C++ Syntax 2 -> Syntax 1 compiler

Other

5.39k stars 232 forks source link

Feature/regular expression metafunction #904

Closed MaxSagebaum closed 1 month ago

MaxSagebaum commented 8 months ago

I will update this overview such that it is easy to grasp the status of the implementation.

Example file: example.cpp2

example: @regex type = {
  regex := "ab*bd";
}
main: (args) = {
    r := example().regex.search("abbbbbdfoo");
    std::cout << "got: (r.group(0))$" << std::endl;
}

Current status and planned on doing

Modifiers

 - [x] i                Do case-insensitive pattern matching. For example, "A" will match "a" under /i.
 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.
 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.
 - [ ] c                keep the current position during repeated matching

Escape sequences (Complete)

 - [x] \t          tab                   (HT, TAB)
 - [x] \n          newline               (LF, NL)
 - [x] \r          return                (CR)
 - [x] \f          form feed             (FF)
 - [x] \a          alarm (bell)          (BEL)
 - [x] \e          escape (think troff)  (ESC)
 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Quantifiers (Complete)

 - [x] *           Match 0 or more times
 - [x] +           Match 1 or more times
 - [x] ?           Match 1 or 0 times
 - [x] {n}         Match exactly n times
 - [x] {n,}        Match at least n times
 - [x] {,n}        Match at most n times
 - [x] {n,m}       Match at least n but not more than m times
 - [x] *?        Match 0 or more times, not greedily
 - [x] +?        Match 1 or more times, not greedily
 - [x] ??        Match 0 or 1 time, not greedily
 - [x] {n}?      Match exactly n times, not greedily (redundant)
 - [x] {n,}?     Match at least n times, not greedily
 - [x] {,n}?     Match at most n times, not greedily
 - [x] {n,m}?    Match at least n but not more than m times, not greedily
 - [x] *+     Match 0 or more times and give nothing back
 - [x] ++     Match 1 or more times and give nothing back
 - [x] ?+     Match 0 or 1 time and give nothing back
 - [x] {n}+   Match exactly n times and give nothing back (redundant)
 - [x] {n,}+  Match at least n times and give nothing back
 - [x] {,n}+  Match at most n times and give nothing back
 - [x] {n,m}+ Match at least n but not more than m times and give nothing back

Character Classes and other Special Escapes (Complete)

 - [x] [...]     [1]  Match a character according to the rules of the
                    bracketed character class defined by the "...".
                    Example: [a-z] matches "a" or "b" or "c" ... or "z"
 - [x] [[:...:]] [2]  Match a character according to the rules of the POSIX
                    character class "..." within the outer bracketed
                    character class.  Example: [[:upper:]] matches any
                    uppercase character.
 - [x] \g1       [5]  Backreference to a specific or previous group,
 - [x] \g{-1}    [5]  The number may be negative indicating a relative
                  previous group and may optionally be wrapped in
                  curly brackets for safer parsing.
 - [x] \g{name}  [5]  Named backreference
 - [x] \k<name>  [5]  Named backreference
 - [x] \k'name'  [5]  Named backreference
 - [x] \k{name}  [5]  Named backreference
 - [x] \w        [3]  Match a "word" character (alphanumeric plus "_", plus
                    other connector punctuation chars plus Unicode
                    marks)
 - [x] \W        [3]  Match a non-"word" character
 - [x] \s        [3]  Match a whitespace character
 - [x] \S        [3]  Match a non-whitespace character
 - [x] \d        [3]  Match a decimal digit character
 - [x] \D        [3]  Match a non-digit character
 - [x] \v        [3]  Vertical whitespace
 - [x] \V        [3]  Not vertical whitespace
 - [x] \h        [3]  Horizontal whitespace
 - [x] \H        [3]  Not horizontal whitespace
 - [x] \1        [5]  Backreference to a specific capture group or buffer.
                    '1' may actually be any positive integer.
 - [x] \N        [7]  Any character but \n.  Not affected by /s modifier
 - [x] \K        [6]  Keep the stuff left of the \K, don't include it in $&

Assertions

 - [x] \b     Match a \w\W or \W\w boundary
 - [x] \B     Match except at a \w\W or \W\w boundary
 - [x] \A     Match only at beginning of string
 - [x] \Z     Match only at end of string, or before newline at the end
 - [x] \z     Match only at end of string
 - [ ] \G     Match only at pos() (e.g. at the end-of-match position
          of prior m//g)

Capture groups (Complete)

 - [x] (...)

Quoting metacharacters (Complete)

 - [x] For ^.[]$()*{}?+|\

Extended Patterns

 - [x] (?<NAME>pattern)            Named capture group
 - [x] (?#text)                    Comments
 - [x] (?adlupimnsx-imnsx)         Modification for surrounding context
 - [x] (?^alupimnsx)               Modification for surrounding context
 - [x] (?:pattern)                 Clustering, does not generate a group index.
 - [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group
 - [ ] (?(condition)yes-pattern|no-pattern)  Conditional patterns.
 - [ ] (?(condition)yes-pattern)             Conditional patterns.
 - [ ] (?>pattern)                 Atomic patterns. (Disable backtrack.)
 - [ ] (*atomic:pattern)           Atomic patterns. (Disable backtrack.)

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.
 - [ ] (?<=pattern)                    Positive look behind.
 - [ ] (*plb:pattern)                  Positive look behind.
 - [ ] (*positive_lookbehind:pattern)  Positive look behind.
 - [ ] (?<!pattern)                    Negative look behind.
 - [ ] (*nlb:pattern)                  Negative look behind.
 - [ ] (*negative_lookbehind:pattern)  Negative look behind.

Special Backtracking Control Verbs

 - [ ] (*PRUNE) (*PRUNE:NAME)   No backtracking over this point.
 - [ ] (*SKIP) (*SKIP:NAME)     Start next search here.
 - [ ] (*MARK:NAME) (*:NAME)    Place a named mark.
 - [ ] (*THEN) (*THEN:NAME)     Like PRUNE.
 - [ ] (*COMMIT) (*COMMIT:arg)  Stop searching.
 - [ ] (*FAIL) (*F) (*FAIL:arg) Fail the pattern/branch.
 - [ ] (*ACCEPT) (*ACCEPT:arg)  Accept the pattern/subpattern.

Not planned (Mainly because of Unicode or perl specifics)

Modifiers

 - [ ] p                Preserve the string matched such that ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} are available for use after matching.
 - [ ] a, d, l, and u   These modifiers, all new in 5.14, affect which character-set rules (Unicode, etc.) are used, as described below in "Character set modifiers".
 - [ ] g                globally match the pattern repeatedly in the string
 - [ ] e                evaluate the right-hand side as an expression
 - [ ] ee               evaluate the right side as a string then eval the result
 - [ ] o                pretend to optimize your code, but actually introduce bugs
 - [ ] r                perform non-destructive substitution and return the new value

Escape sequences

 - [ ] \cK         control char          (example: VT)
 - [ ] \N{name}    named Unicode character or character sequence
 - [ ] \N{U+263D}  Unicode character     (example: FIRST QUARTER MOON)
 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

Character Classes and other Special Escapes

 - [ ]  (?[...])  [8]  Extended bracketed character class
 - [ ] \pP       [3]  Match P, named property.  Use \p{Prop} for longer names
 - [ ] \PP       [3]  Match non-P
 - [ ] \X        [4]  Match Unicode "eXtended grapheme cluster"
 - [ ] \R        [4]  Linebreak

Assertions

 - [ ] \b{}   Match at Unicode boundary of specified type
 - [ ] \B{}   Match where corresponding \b{} doesn't match

Extended Patterns

 - [ ] (?{ code })                 Perl code execution.
 - [ ] (*{ code })                 Perl code execution.
 - [ ] (??{ code })                Perl code execution.
 - [ ] (?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)       Recursive subpattern.
 - [ ] (?&NAME)                   Recursive subpattern.

Script runs

 - [ ] (*script_run:pattern)         All chars in pattern need to be of the same script.
 - [ ] (*sr:pattern)                 All chars in pattern need to be of the same script.
 - [ ] (*atomic_script_run:pattern)  Without backtracking.
 - [ ] (*asr:pattern)                Without backtracking.

MaxSagebaum commented 8 months ago

I am aiming to implement the POSIX extended specification with a few extras from perl. All in all I am sticking to the perl interpretation of regular expressions.

MaxSagebaum commented 8 months ago

The feature set is now complete as stated in https://en.wikipedia.org/wiki/Regular_expression. I grabbed the test suite from https://wiki.haskell.org/Regex_Posix https://hackage.haskell.org/package/regex-posix-unittest. I am currently working my way through the tests by fixing all the corner cases. Especially, the greed nature of * and the backtracking if grabbed to much might require a rework of the matching logic.

MaxSagebaum commented 7 months ago

I finished now the basic implementation and most of the test suite is passed.

Some notable differences with respect to posix ERE:

Alternatives are not greedy: a|ab matches a and not ab in ab.
The match is case sensitive.

I am now looking into performance tests and I will clean up the code.

MaxSagebaum commented 7 months ago

I did some basic performance testing. For this I used the benchmark-exec from https://github.com/hanickadot/compile-time-regular-expressions and extended it to include the regular expression metafunction from cppfront. The first result where:

100 mb file:
ctre: 409 ms
cppfront base (all runtime checks) :           80711 ms
cppfront no runtime checks:                    15783 ms (options: -no-c -no-n -no-s)
cppfront no runtime checks & no copy of state:  3021 ms

As you can see, the runtime checks and the copy of the regex state really hit me. But nevertheless, the result was quite underwhelming. My implementation is 8 times slower than ctre.

Since both are using templates to build up the regular expression, I was wondering and took a closer look. ctre always provides the remainder of the regular expression to each matcher. That is, each matcher can verify if his match is a valid form of the whole regular expression. I did not do this and therefore I had to keep the state of the matchers and some ways to restore it. This was very costly. Fortunately I could adapt my implementation quite easily.

The rewritten version looks now better:

100 mb file
ctre:                           409 ms
cppfront no runtime checks:     250 ms (options: -no-c -no-n -no-s
cppfront with runtime checks: 10918 ms

The runtime checks still hit quite hard. I would not advertise that my implementation is faster than ctre. It is still quite rudimentary and e.g. can not be configured, to ignore case. Therefore, I would only say, that they in the same performance region.

The performance test for on a 4GB file looks similar:

"boost::regex";"ABCD|DEFGH|EFGHI|A{4,}"; 115040 ms
"cppfront";"ABCD|DEFGH|EFGHI|A{4,}";       5154 ms
"CTRE";"ABCD|DEFGH|EFGHI|A{4,}";           8959 ms
"PCRE2";"ABCD|DEFGH|EFGHI|A{4,}";         18784 ms
"PCRE2 (jit)";"ABCD|DEFGH|EFGHI|A{4,}";    5003 ms
"RE2";"ABCD|DEFGH|EFGHI|A{4,}";            9113 ms
"srell";"ABCD|DEFGH|EFGHI|A{4,}";         17065 ms
"std::regex";"ABCD|DEFGH|EFGHI|A{4,}";   110107 ms

Compilation times:

(The pure2-regex.cpp2 file.)
428 regular expressions.
cppfront:
cpp2 -> cpp: 1.37 s
cpp -> exe: 23.82 s

ctre:
cpp -> exe: 733.12 s

I think, we have a real benefit here. With the metafunction, we can parse the regular expression with regular code that builds up the templates. The compiler only needs to compile the template and not parse it.

Now with the performance tests done, I can finally clean up the code. ;)

MaxSagebaum commented 7 months ago

I did a mistake in the cppfront recompilation with the runtime checks. Updated the value from 350 ms to 10918 ms.

JohelEGP commented 7 months ago

Thank you for looking into this.

The initial implementation declares more templates that I wished for (0). With https://github.com/hsutter/cppfront/discussions/797#discussioncomment-7759206, I literally meant lowering the match for /(a|b)/ to return s == 'a' || s == 'b';. For #514, that would mean lowering the match for ^alignas|^alignof|^asm|^as|^auto|… to the merged call to std::find_if.

I understand that is sub-optimal to generalize, as there are better algorithms for string matching. But I think it would be best to start with build-time performance in mind.

The performance test for on a 4GB file looks similar:

Can you explain this in more detail? What is ABCD|DEFGH|EFGHI|A{4,}? What's the regex that is taking CTRE 733 s to compile and @regex 24 s?

MaxSagebaum commented 7 months ago

The initial implementation declares more templates that I wished for (0). With #797 (comment), I literally meant lowering the match for /(a|b)/ to return s == 'a' || s == 'b';. For #514, that would mean lowering the match for ^alignas|^alignof|^asm|^as|^auto|… to the merged call to std::find_if.

Ok, now I understand you and I also get the implication for compile times. The problem is, that this will be very ugly for larger expressions. Also ranges will be quite hard to handle in such a way.

The performance test for on a 4GB file looks similar:

Can you explain this in more detail? What is ABCD|DEFGH|EFGHI|A{4,}? What's the regex that is taking CTRE 733 s to compile and @regex 24 s?

ABCD|DEFGH|EFGHI|A{4,} was the regex that was matched on the file. It was generated with

dd if=/dev/urandom bs=2147483648 count=1 | base64 > test.txt

The ms results are the runtime to search for the first match in each file. All in all, there are only 137 matches or so. Therefore, no early out is possible and the regular expression need to scan the whole line.

The compilation times are for the pure2-regex.cpp2 file which includes 428 regular expressions. I adapted this file so that ctre can also run with it. The rough approximation for the compile time per regular expression would be:

cppfront cpp2 -> cpp: 0.003 seconds per regular expression
cppfront cpp -> exe 0.05 seconds per regular expression
ctre 1.71 seconds per regular expression

Although the number for ctre seems quite high. She reported on a compilation time of 0.1 seconds per regular expressions. Maybe there is one regular expression that is quite intensive on the compiler.

JohelEGP commented 7 months ago

It seems like your solution is doing great in terms of performance.

MaxSagebaum commented 7 months ago

I did now the cleanup of the code.

I consider this now finished from my side. I would appreciate a review and suggestions on the final implementation/solutions. I want to squash the history into one commit.

Some questions that are still open from my side:

[ ] Is the string_util.h ok where it is or should it be moved.
[x] Is the include #include "../source/regex.h" in cpp2util.h ok?

Some things I still need to do:

[ ] Check the todos in regex.h2 and raise issues or reference existing issues.
[x] Write the documentation for regex in reflect.h2. (Missed that on my last commit.)
[x] Fix github regression tests.
[ ] Squash all commits. (I think the history is not that useful.)

JohelEGP commented 7 months ago

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

MaxSagebaum commented 7 months ago

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

What would be the correct way to solve this? Move source/regex.h to include and create a link to source? Create a link to include? Should the same be done for reflect.h?

JohelEGP commented 7 months ago

Is the include #include "../source/regex.h" in cpp2util.h ok?

That's not OK by convention.

What would be the correct way to solve this? Move source/regex.h to include and create a link to source? Create a link to include? Should the same be done for reflect.h?

Like I do for cpp2reflect.h in #907. Move source/regex.h to include/cpp2regex.h. If something in source/ wants to #include "cpp2regex.h", you can do like source/cpp2util.h.

MaxSagebaum commented 7 months ago

I added the cpp2regex.h to include.

MaxSagebaum commented 7 months ago

Done.

hsutter commented 7 months ago

Thanks again for this. Questions:

There are 126 "Failure:" in the output. Is that intentional?
Is being non-greedy a correctness issue?

MaxSagebaum commented 7 months ago

I just count 63 "Failure". I think you counted two files. ;-)

Just a few notes first:

Implementing the non greedy version is much simpler.
It is basically a convention. Perl regular expressions have the alternative as non greedy.

So the remaining failure cases are just due to the non greedy nature of the alternative match. I though about it in the meantime and thought that I had solution. I tried to implement it. Oh, how I was wrong. But, 3 hours later I have a working version of the greedy alternative match. This is actually a hit on performance, since it is much more involved:

100 Mb text file

"cppfront: greedy-alternative";"ABCD|DEFGH|EFGHI|A{4,}";       251 ms
"cppfront: non-greedy-alternative";"ABCD|DEFGH|EFGHI|A{4,}";  1238 ms
"CTRE";"ABCD|DEFGH|EFGHI|A{4,}";                               411 ms

The greedy nature reduces the "Failures" to 40. The remaining ones seem to come from the interaction of the ranges and alternatives. The best example is class::9. I need to think about this a little bit more, but I do not think that I can change the implementation to make this work. If you want I can elaborate what goes wrong and why it is not possible.

Summary: The failures come from the non-greedy nature and the interaction with the ranges match. This are mostly really hard corner cases. So the regex implementation should be good enough.

Since the greedy nature really hits performance, I made it a compile time constant and left the old behavior

hsutter commented 7 months ago

Thanks! Sorry for the double-counting, I just hit Ctrl-F and let Chrome tell me the total #hits. 😄

Making greediness an option sounds perfect -- then it's don't-pay-for-what-you-don't-use, and it can be compared to both greedy and non-greedy alternative implementations.

When you say interaction of ranges and alternatives, do you mean examples like a-f|e-j, or something else?

Understood about being good enough, if the failure cases are rare edge cases -- other regex implementations have bugs too (including all the three major standard library implementations, I think). As long as it's in the neighborhood of being as complete as others it can be compared, and I'm very interested in doing comparisons.

Here's a litmus test: I was thinking of using this as an example in a talk in April. Would I be credible or criticized if I used this as an example on stage, and compared it to other regex implementations to demonstrate the benefits of a source code generation approach? Is it good enough to be an apples-to-apples comparison, or would there be legitimate "but it takes shortcuts / is only a partial implementaiton" objections?

Also, if we merge this now, what opportunities are there in the next ~2 months to further improve one or more of { run time, compile time, completeness/accuracy-if-needed } ?

Thanks again for this, it's very interesting and already showing strong progress. Much appreciated!

MaxSagebaum commented 7 months ago

Thanks! Sorry for the double-counting, I just hit Ctrl-F and let Chrome tell me the total #hits. 😄

No problem. ;-)

Understood about being good enough, if the failure cases are rare edge cases -- other regex implementations have bugs too (including all the three major standard library implementations, I think). As long as it's in the neighborhood of being as complete as others it can be compared, and I'm very interested in doing comparisons.

I would also like to do comparisons on a larger basis and I think the implementation is already ready for this.

Here's a litmus test: I was thinking of using this as an example in a talk in April. Would I be credible or criticized if I used this as an example on stage, and compared it to other regex implementations to demonstrate the benefits of a source code generation approach? Is it good enough to be an apples-to-apples comparison, or would there be legitimate "but it takes shortcuts / is only a partial implementaiton" objections?

The shortcuts I currently take are just minor ones and could be fixed in a reasonable short time. Using this in a demo should therefore not be a problem. Objections from a general listener should not arise, only regex experts might have objects to the details but should be aware that this is just an example. In order to minimize objections, I would propose two parts of the demo:

Show a simple enough regular expression and compare the performance.
Show a very involved regular expression no body understands and show that it works and compare the performance. (Would have to look for such one myself.)

On point that is very important would be to cite the work of Hana Dusíková with Compile time regular expressions (https://github.com/hanickadot/compile-time-regular-expressions) since this implementation is based/inspired by her work. She also mentioned in one of her last talks that she wants to explore a better way for the compile-time generation.

Also, if we merge this now, what opportunities are there in the next ~2 months to further improve one or more of { run time, compile time, completeness/accuracy-if-needed } ?

Short answer: Yes.

run time: maybe but probably not,
compile time: maybe (I would love to get some hints on how to measure the performance of the compiler)
completeness: yes

Long answer: With regard to the explosion of possible matches I describe below, I would go away from the POSIX ERE compatible implementation to a perl conforming implementation. (Here the alternative | returns the first possible match.) Most of the basic missing perl features could be implemented quite easily. The more advanced ones like look ahead and look behind could take a bit, but are not strictly required.

I would do the chagen, which I would do like this:

Write a converter for the perl regular expression test suite: https://perl5.git.perl.org/perl5.git/tree/HEAD:/t/re and replace the current test suite.
Add the basic perl syntax e.g. \n, *?, *+.
Work on more advanced perl features. (This may take some time and would be a project for the year.)

I would say that like 95% of the perl regex syntax could be done in the next two month. Everything else would require more time. (Just a very rough estimate.)

When you say interaction of ranges and alternatives, do you mean examples like a-f|e-j, or something else?

I was referring to this expression: (aba|a*b)* the test says it should match ababa with the groups: 0 = 0-5 is ababa, and 1 = 2-5 is aba. We have the alternative aba|a*b and two ranges *. Posix requires the alternative to be greedy and the whole match to be greedy. That is, the alternative must choose the one that yields the longest match in general. In the example above first we takeaba. A second aba does not match so we take a*b which only matches b. A third successful match is not possible, so the result is abab which has length of 4. Then we also need to consider a*b for the first iteration/alternative and match ab. The second iteration can now match aba which matches the whole string and has a length of 5. This means the first alternative does not take the longest submatch aba since the shorter submatch ab produces an overall longer match.

Even with (aba|ab)* the number of possibilities is 2^n where n defines how often * can find a match. And all possiblites need to be tried.

Our implementation has the problem, that we do not provide the expression tail to the inner expression of the *. Therefore, it can not decide what is the longest overall match. Simplified it looks like:

r:= M::match(cur, ctx, no_tail<CharT>()); // Match the inner expression without any tail. E.g. 'aba|a*b'
o:= ...; // Check if recursively if 'M' matches again and the tail
if o.matched {
  return o 
}
else if r.matched{ // No other match, try to match the tail.
  return Tail::match(cur, ctx);
}
return false;

I though on how this could be changed, but did not came to a fully satisfying answer. Either we would have an infinite recursion in the compiler. Or we would need an extensive state management in for the ranges. This could be implemented by moving everything from static functions to non static ones, but I am not really sure if it really would solve the problem and how much performance it would cost. I might try this if I have a lot of time left.

MaxSagebaum commented 6 months ago

I compiled a list of the perl regex features and started implementing them last weekend. Most of the main features are finished and I am using the perl regex test suite to validate the implementation. My next goal is to implement the modifiers and afterwards the extended patterns.

From now on I will gather the information in the first post so that a quick summary of the status can be grasped. But I will also ping updates in comments.

MaxSagebaum commented 5 months ago

I did an update on Friday. The initial post is updated. The new features are:

Modifiers

 - [x] m                Treat the string being matched against as multiple lines. That is, change "^" and "$" from matching the start of the string's first line and the end of its last line to matching the start and end of each line within the string.
 - [x] s                Treat the string as single line. That is, change "." to match any character whatsoever, even a newline, which normally it would not match.

Extended Patterns

 - [x] (?adlupimnsx-imnsx)         Modification for surrounding context
 - [x] (?^alupimnsx)               Modification for surrounding context
 - [x] (?:pattern)                 Clustering, does not generate a group index.
 - [x] (?adluimnsx-imnsx:pattern)  Clustering, does not generate a group index and modifications for the cluster.
 - [x] (?^aluimnsx:pattern)        Clustering, does not generate a group index and modifications for the cluster.

MaxSagebaum commented 5 months ago

A further update. The initial post is updated. The new features are:

Modifiers

 - [x] x and xx         Extend your pattern's legibility by permitting whitespace and comments. Details in "/x and /xx"
 - [x] n                Prevent the grouping metacharacters () from capturing. This modifier, new in 5.22, will stop $1, $2, etc... from being filled in.

Extended Patterns

 - [x] (?#text)                    Comments
 - [x] (?|pattern)                 Branch reset
 - [x] (?'NAME'pattern)            Named capture group

MaxSagebaum commented 5 months ago

A currently final update. I added now the lookahead functions.

Escape sequences

 - [x] \x{}, \x00  character whose ordinal is the given hexadecimal number
 - [x] \o{}, \000  character whose ordinal is the given octal number

Lookaround Assertions

 - [x] (?=pattern)                     Positive look ahead.
 - [x] (*pla:pattern)                  Positive look ahead.
 - [x] (*positive_lookahead:pattern)   Positive look ahead.
 - [x] (?!pattern)                     Negative look ahead.
 - [x] (*nla:pattern)                  Negative look ahead.
 - [x] (*negative_lookahead:pattern)   Negative look ahead.

I moved these to will not do, since they are modifiers on the replacement string.

Escape sequences

 - [ ] \l          lowercase next char (think vi)
 - [ ] \u          uppercase next char (think vi)
 - [ ] \L          lowercase until \E (think vi)
 - [ ] \U          uppercase until \E (think vi)
 - [ ] \Q          quote (disable) pattern metacharacters until \E
 - [ ] \E          end either case modification or quoted section, think vi

MaxSagebaum commented 5 months ago

I consider the implementation now finished. The major features are implemented and the remaining ones should not change the basic design that much. This is also because I have not that much time in the next 2 months.

hsutter commented 5 months ago

Thanks! I'll start taking a pass over the failing tests...

MaxSagebaum commented 5 months ago

Thanks. I planed to do that after the review/I am back from a vacation this week.

hsutter commented 5 months ago

[corrected: somehow I initially got a non-current PR]

Thanks! Initial comment: This might be the first major PR that compiles totally clean on the first try for me including at high Cpp1 warning levels. (Almost: MSVC had two unused name warnings, and Clang 12 couldn't handle std::format). Nice -- that's hard to do unless regularly building with all major compilers locally, which understandably not everyone has available.

hsutter commented 5 months ago

I've taken a first pass through and have some commits to push -- please hold any new commits until I can fix my branch(*) and push those, thanks!

(*) boring reasons: because it seems GitHub Desktop got confused with conflicts (which seems impossible since there haven't been any other pushes today) and the solution that seems to work is to reset the branch and reapply the changes but that will take me a little more time

hsutter commented 5 months ago

OK, I've pushed all the commits I have for now from the first review.

✅ Cppfront builds clean
⚠️ The test case file pure2-regex.cpp2 generates cppfront-compile-time metafunction errors for some of the cases.

I've created a (temporary) pure2-regex-partial.cpp2 which is the same as pure2-regex.cpp2 with all the cases commented out that generated metafunction errors. This compiles clean in cppfront, but I still get strange errors from the Cpp1 compiler, which I've narrowed down to this short repro:

// This is a complete file, that doesn't compile using MSVC.current or GCC 10 ?
#define CPP2_IMPORT_STD          Yes
#include "cpp2util.h"

If the macro is commented out, things compile fine. So there's something about the "import std" path that's going wrong, apparently on both MSVC and GCC 10.

I haven't been able to diagnose the problem further than that though.

That's pretty much all I have for the first review -- please let me know what you think. Thanks again!

hsutter commented 5 months ago

Oh, one more thing: In the generated pure2-regex-partial.cpp, I noticed that the longest line is 397,231 characters long. I wonder if that line length would make any tools hiccup? (Thinking out loud: It should be doable to put some line breaks in there as needed, I can look into that.)

hsutter commented 5 months ago

Odd. Now I see GCC 10 and Clang 12 work fine for me with that minimal repro, and I realized that at least part of it seems to be bugs in MSVC modules, that can't compile this:

import std.compat;
#include <string>   // via string_util.h

or this

#include <string>
import std.compat;

These are known problems. The solution is to not try to do both import and #include.

A group of questions, not particularly related:

Does string_util.h need to be a separate file, or can its contents go into cpp2util.h or even common.h?
Do programs that have been compiled with @regex need any features in string_util.h? I did some searching and it seems those are used only in regex.h2 itself, is that right?
Also, I noticed that I probably shouldn't have generated source/regex.h... I'm now thinking you're generating include/cpp2regex.h from /source/regex.h2, is that right?

Hold on changes please while I take another pass over this now... thanks!

hsutter commented 5 months ago

OK, I started to make some tweaks but I think I don't quite grok the structure well enough yet.

All yours to push new commits, I'll wait on the questions above.

One suggestion for a next commit to unbreak MSVC: Maybe remove the #includes from string_util.h, if it'll only be included via cpp2util.h, and then just add #include <sstream> to cpputil.2 "always include" list.

MaxSagebaum commented 5 months ago

I noticed a degradation in compile time and had a look what caused this. The statefull tail was the reason but I could change the implementation to avoid it. Nevertheless, because of some extra functionality the compile time increased.

Currently for the posix test suit it is: 34.26 seconds (was 60.25 seconds) For the pure2-regex.cpp2 it is: 46.65 seconds (was 58.14 seconds)

MaxSagebaum commented 5 months ago

Does string_util.h need to be a separate file, or can its contents go into cpp2util.h or even common.h?

No it does not we could include it in cpp2util.h. Since cpp2regex.h requires this functionality we need to have it in the public interface. So it can not be moved to common.h. I created an extra file, so that the functionality can be better guessed.

Do programs that have been compiled with @regex need any features in string_util.h? I did some searching and it seems those are used only in regex.h2 itself, is that right?

Yes. See answer above.

Also, I noticed that I probably shouldn't have generated source/regex.h... I'm now thinking you're generating include/cpp2regex.h from /source/regex.h2, is that right?

Yes. See answer above.

OK, I started to make some tweaks but I think I don't quite grok the structure well enough yet.

If you want, we can have a chat and I can give you a rough overview.

One suggestion for a next commit to unbreak MSVC: Maybe remove the #includes from string_util.h, if it'll only be included via cpp2util.h, and then just add #include to cpputil.2 "always include" list.

Sounds good.

hsutter commented 5 months ago

Thanks!

I don't quite grok the structure well enough yet.

OK, after sleeping on it I think I just made a simple mistake... I think I get it now.

Let me take another pass over the code today to move a few things around in ways that I hope don't disturb things.

I'll let you know when I'm done, to avoid merge issues.

Then if it turns out that these changes I make are not helpful / destabilize you, please just revert out all my commits on this branch!

hsutter commented 3 months ago

After recovering from April madness: I decided not to move anything around for now, please go ahead and take over the PR and feel free to revert the commits I made previously.

For next steps: I know you were thinking of expanding this to do more code generation to further reduce uses of templates. What would be your preference... to polish this more so we can merge it (in which case I would want to resume reviewing how the files are structured in names/directories as I'd prefer them to be organized a little differently), or to continue onward in this PR branch? Please let me know what you prefer, and thanks again for this!

MaxSagebaum commented 3 months ago

Ok, I finished the paper and have now time again.

I just started with the first steps in changing the code generation away from the template stack. I will see how this goes over the next week. If it does not take that long I would say that we wait until this is finished.

Do you have some general remarks about the MR, so that I can adhere to this in my changes?

MaxSagebaum commented 3 months ago

I used the last week to refactor everything to the new layout. All tests give the same results, only a few warnings I will fix later.

I will do a more in depth comparison of the two new approaches soon, but currently there is a different problem.

It seems that cppfront has problems with the new code. The generation and runtimes are now:	gen	compile
old regex	0.47 s	40 s
new regex	207 s	70 s

I am not that worried about the compile time. I want to do a few optimizations and want to remove nearly all template parameters, this should then improve the compile time.

I am more worried about the generation time. In the timings for the table cppfront was compiled with -O3. I added some timing measurements inside the regex_gen metafunction. Here, the results where select: 0.000000 remove: 0.001000 gen: 0.000000 add: 0.481000 with a total time of 24 seconds. So after the metafunctions run, cppfront takes now quite some time. I did an update to the newest version, which does not help.

A perf on a debug version for a couple of regular expressions gave me the following result:

# Overhead  Command   Shared Object         Symbol                                                                                                                                                                                                                                                                           >
# ........  ........  ....................  .................................................................................................................................................................................................................................................................................>
#
    15.72%  cppfront  cppfront              [.] cpp2::symbol::get_token() const
    12.47%  cppfront  cppfront              [.] std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym>::index() const
     8.84%  cppfront  cppfront              [.] cpp2::token::get_global_token_order() const
     6.98%  cppfront  cppfront              [.] cpp2::symbol::get_global_token_order() const
     5.65%  cppfront  cppfront              [.] cpp2::sema::get_declaration_of(cpp2::token const&, bool, bool) const
     4.12%  cppfront  cppfront              [.] bool __gnu_cxx::operator==<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >(__gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > > const&, __gnu_cxx::__normal_iterator<cpp2::symbol con>
     3.16%  cppfront  cppfront              [.] __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >::__normal_iterator(cpp2::symbol const* const&)
     2.90%  cppfront  cppfront              [.] __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >::operator->() const
     2.60%  cppfront  cppfront              [.] std::vector<cpp2::symbol, std::allocator<cpp2::symbol> >::cend() const
     2.59%  cppfront  cppfront              [.] std::variant_alternative<1ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> >::type const& std::get<1ul, cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym>(std::variant<cpp2::declar>
     2.41%  cppfront  cppfront              [.] __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >::operator++()
     2.30%  cppfront  cppfront              [.] cpp2::identifier_sym::get_token() const
     2.22%  cppfront  cppfront              [.] __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >::base() const
     2.17%  cppfront  cppfront              [.] std::variant_alternative<3ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> >::type const& std::get<3ul, cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym>(std::variant<cpp2::declar>
     1.59%  cppfront  cppfront              [.] std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const& std::forward<std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compoun>
     1.58%  cppfront  cppfront              [.] std::variant_alternative<0ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> >::type const& std::get<0ul, cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym>(std::variant<cpp2::declar>
     1.49%  cppfront  cppfront              [.] cpp2::selection_sym::get_token() const
     1.47%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get<1ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&)
     1.20%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get_n<1ul, std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifi>
     1.16%  cppfront  cppfront              [.] cpp2::declaration_sym::get_token() const
     1.10%  cppfront  cppfront              [.] std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const& std::forward<std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::remove_reference<std::variant<cpp2::de>
     1.00%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get_n<3ul, std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifi>
     0.96%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get<3ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&)
     0.88%  cppfront  cppfront              [.] cpp2::sema::get_declaration_of(cpp2::token const&, bool, bool) const::{lambda(__gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >&, int, __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol,>
     0.80%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get_n<0ul, std::__detail::__variant::_Variadic_union<cpp2::compound_sym> const&>(std::__detail::__variant::_Variadic_union<cpp2::compound_sym> const&)
     0.72%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get<0ul, std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::variant<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&)
     0.66%  cppfront  cppfront              [.] __gnu_cxx::__normal_iterator<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >::difference_type __gnu_cxx::operator-<cpp2::symbol const*, std::vector<cpp2::symbol, std::allocator<cpp2::symbol> > >(__gnu_cxx::__normal_iterator<cpp2::symbol >
     0.61%  cppfront  cppfront              [.] decltype(auto) std::__detail::__variant::__get_n<0ul, std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifier_sym, cpp2::selection_sym, cpp2::compound_sym> const&>(std::__detail::__variant::_Variadic_union<cpp2::declaration_sym, cpp2::identifi>
     0.52%  cppfront  cppfront              [.] std::__detail::__variant::_Variadic_union<cpp2::compound_sym> const& std::forward<std::__detail::__variant::_Variadic_union<cpp2::compound_sym> const&>(std::remove_reference<std::__detail::__variant::_Variadic_union<cpp2::compound_sym> const&>::type&)

Full test suite on an optimized binary:

# Samples: 783K of event 'cycles:Pu'
# Event count (approx.): 541739679548
#
# Overhead  Command   Shared Object              Symbol                                                                                                                                                                                                                                                                      >
# ........  ........  .........................  ............................................................................................................................................................................................................................................................................>
#
    65.60%  cppfront  cppfront                   [.] cpp2::sema::get_declaration_of(cpp2::token const&, bool, bool) const
    32.10%  cppfront  cppfront                   [.] cpp2::symbol::get_token() const
     0.85%  cppfront  cppfront                   [.] cpp2::is_definite_last_use(cpp2::token const*)
     0.24%  cppfront  cppfront                   [.] cpp2::token::operator==(std::basic_string_view<char, std::char_traits<char> >) const
     0.19%  cppfront  libc.so.6                  [.] __memcmp_avx2_movbe
     0.09%  cppfront  cppfront                   [.] cpp2::declaration_node::gather_type_scope_declarations(cpp2::declaration_node::which) const
     0.07%  cppfront  cppfront                   [.] cpp2::cppfront::source_order_name_lookup(std::basic_string_view<char, std::char_traits<char> >)
     0.06%  cppfront  cppfront                   [.] std::basic_string_view<char, std::char_traits<char> >::starts_with(std::basic_string_view<char, std::char_traits<char> >) const [clone .isra.0]
     0.05%  cppfront  [unknown]                  [k] 0xffffffffad20194c
     0.05%  cppfront  cppfront                   [.] std::basic_string_view<char, std::char_traits<char> >::compare(std::basic_string_view<char, std::char_traits<char> >) const
     0.04%  cppfront  libc.so.6                  [.] malloc
     0.04%  cppfront  cppfront                   [.] cpp2::declaration_node::get_decl_if_type_scope_object_name_before_a_base_type(std::basic_string_view<char, std::char_traits<char> >) const
     0.03%  cppfront  cppfront                   [.] cpp2::token::operator==(cpp2::token const&) const

There seems to be problem in sema::get_declaration_of. Any pointers are welcome. I will try to look into the problem.

The generated code for ABCD|DEFGH|EFGHI|A{4,} looks now like:

regex: type = {
  group_count: type == std::integral_constant<int, 1>;  //TODO: Use static constexpr when alpha limitation of nested types declarations is lifted.
  initial_flags: type == std::integral_constant<int, 0>;  //TODO: Use static constexpr when alpha limitation of nested types declarations is lifted.
    func_1: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'A', 'a', 'A'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'B', 'b', 'B'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'C', 'c', 'C'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'D', 'd', 'D'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
    func_2: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'D', 'd', 'D'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'E', 'e', 'E'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'F', 'f', 'F'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'G', 'g', 'G'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'H', 'h', 'H'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
    func_3: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'E', 'e', 'E'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'F', 'f', 'F'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'G', 'g', 'G'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'H', 'h', 'H'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'I', 'i', 'I'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
    func_5: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'A', 'a', 'A'>(r.pos, ctx, modifiers) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
    func_4: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);

        r = cpp2::regex::range_token_matcher<char, 4, -1, 2>::match(r.pos, ctx, modifiers, func_5(), cpp2::regex::no_reset(), other, func_6());
        return r;
      }
    }
    func_6: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
    func_0: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);

        r = cpp2::regex::alternative_token_matcher<char>::match(r.pos, ctx, modifiers, other, func_7() , func_1(), cpp2::regex::no_reset(), func_2(), cpp2::regex::no_reset(), func_3(), cpp2::regex::no_reset(), func_4(), cpp2::regex::no_reset());
        return r;
      }
    }
    func_7: @struct type = {
      operator(): (this, cur, inout ctx, modifiers, other) -> _ = {
        r := ctx.pass(cur);

        r = other(r.pos, ctx, modifiers);
        return r;
      }
    }
  entry: (cur, inout ctx, modifiers) -> _ = {
    ctx.set_group_start(0, cur);
    r := func_0()(cur, ctx, modifiers, cpp2::regex::true_end_func());
    if r.matched { ctx.set_group_end(0, r.pos); }
    return r;
  }
  to_string: () -> std::string = { return "ABCD|DEFGH|EFGHI|A{4,}|"; }
get_named_group_index: (name) -> int = {
  _ = name;
  return -1;
}
}

hsutter commented 3 months ago

Do you have some general remarks about the MR, so that I can adhere to this in my changes?

Catching up on this: Nothing concrete at this point. I was started doing a small reorganization to move the files to /include but I never bottomed out on that -- and to do it properly should probably wait for me to finish supporting .h2 files that are not included as .h2 (which has been on my todo list).

So, no general remarks...

hsutter commented 3 months ago

I am more worried about the generation time.

Me too. I wonder if it's

a) a regression? e.g., does the new code on the earlier-2024 cppfront show a performance difference?

b) that the new code is exercising some slow paths more? e.g., does it generate a lot more source tokens than before? if so it might be hitting a bottleneck with my using a deque for those extra tokens, which is problematic and I've been meaning to fix that

In the timings for the table cppfront was compiled with -O3. I added some timing measurements inside the regex_gen metafunction. Here, the results where select: 0.000000 remove: 0.001000 gen: 0.000000 add: 0.481000 with a total time of 24 seconds. So after the metafunctions run, cppfront takes now quite some time. I did an update to the newest version, which does not help.

By "add" I think you mean actually adding and parsing new generated code? That would definitely exercise logic like the token deque, if that's the problem (I need to measure). Do you know if it's generating a lot more Cpp2 source code (tokens) than the previous version that was faster? That would be a strong indicator for me to focus there first.

There seems to be problem in sema::get_declaration_of. Any pointers are welcome. I will try to look into the problem.

For this one, I wonder if there might be a regression from a recent merged PR? There was an overhaul of the symbol table navigation logic not too long ago... I've noticed that the pure2-last-use.cpp2 test case runs noticeably longer than the others, and it's stress testing that part of the code. That test case was part of #887.

An idea, but only if this is easy: If your new regex code is reasonably easy to try on a pre-#887-merge snapshot of cppfront, one quick test would be to check the timings of your new regex code before vs after #887 was merged. If there's a big perf difference, when we know some commit/merge (not necessarily #887, could be something else later) is part of the problem. But again, please don't do a lot of work for that, only if it's easy to test quickly.

The above is a good start on performance data -- once we figure out what's the hot path for your regex now, we can target that. I'll make that a priority -- not just because I care about regex (I do, this is very interesting) but because it's the best reflection+generation stress test yet (until now the reflection+generation has been fairly tame useful-but-local things).

MaxSagebaum commented 3 months ago

By "add" I think you mean actually adding and parsing new generated code? That would definitely exercise logic like the token deque, if that's the problem (I need to measure).

No, I added some timing tests in the metafunction logic. The values represent the different steps in the regex metafunction. In total it takes 0.482 seconds. The major portion being add: 0.481000, which is the call to t.add_member(..). So, the generation of the code takes nearly no time. The whole 207 seconds are after the metafunction is finished.

Do you know if it's generating a lot more Cpp2 source code (tokens) than the previous version that was faster? That would be a strong indicator for me to focus there first.

The old regex version generated just one template definition, e.g.:

regex: type = regular_expression<...>; // The ... was a very long template definitions with nested templates.

The new regex version generates the code I have shown above. So there is a lot more tokens cppfront needs to parse.

a) a regression? e.g., does the new code on the earlier-2024 cppfront show a performance difference?

b) that the new code is exercising some slow paths more? e.g., does it generate a lot more source tokens than before? if so it might be hitting a bottleneck with my using a deque for those extra tokens, which is problematic and I've been meaning to fix that

I think it is mixture of both. As you suggested, I did a test around the #887 merge request. There seems to be a rebase on cppfront since it was merged. In github the object is 68f93d1, on the current repo the object is 510eae8. It seems that #887 introduced the regression. But also beforehand 13.86, seconds is quite long.

	gen	compile
before 887 (8412611)	13.86 s	67.61 s
after 887 (510eae8)	195.33 s	67.99 s

MaxSagebaum commented 3 months ago

Just a small update. I removed the modifiers from the arguments. The new stats are now:

	gen	compile
old regex	0.47 s	40 s
new regex	207 s	70 s
new regex removed modifier argument	121 s	75 s

So it seems that removing the argument helped the cppfront analysis. The next step is to get rid of the struct wrapper for the functions. This should then reduce the compile time by a lot.

hsutter commented 3 months ago

[Corrected for major/minor problem]

This is great, thanks! I think you've identified the two "low-hanging fruit" things to target: The generated token handling, and looking into #887.

1. Generation perf (minor problem)

By "add" I think you mean actually adding and parsing new generated code? That would definitely exercise logic like the token deque, if that's the problem (I need to measure).

[...] The major portion being add: 0.481000, which is the call to t.add_member(..).

Right, that add_member call does only one expensive thing: It tokenizes and parses the string. I'll focus on measuring and improving that. My first suspect is that deque, but I'll measure...

2. 887 perf (major problem)

It seems that #887 introduced the regression.

OK, I'll take a look.

@JohelEGP if you're available, that was your PR so you're most familiar with the code... would you be able to take a look at the before/after #887 and see if there's a way to optimize sema::get_declaration_of specifically and the pure-last-use.cpp2 test case in general?

hsutter commented 3 months ago

I removed the modifiers from the arguments.

The next step is to get rid of the struct wrapper for the functions.

👍 Paring down some things like that is great for debugging (or working around) this perf issue, thanks!

I do want performance not to be a hindrance for you to be able to write whatever modifiers and wrappers are best for your code, of course. So once we fix this you should be able to put any of these back that you want.

MaxSagebaum commented 3 months ago

I removed the modifiers from the arguments.

The next step is to get rid of the struct wrapper for the functions.
👍 Paring down some things like that is great for debugging (or working around) this perf issue, thanks!

There seems, to be still an misunderstanding. ;-)

From the original 207 seconds. The parsing of the code takes about zero seconds, the application of the metafunction regex can be broken down into two steps. The parsing and generation of the regex takes also nearly zero time. Adding the generated code with t.add_member(...) takes most of the time of the metafunction which is 0.5 seconds. The remaining 206 seconds are after the metafunction has been applied. So it is just the sema analysis and the writing of the code. It seems that the sema analysis is taking up all the time.

I do want performance not to be a hindrance for you to be able to write whatever modifiers and wrappers are best for your code, of course. So once we fix this you should be able to put any of these back that you want.

The modifications I am doing are not because of the bottleneck in cppfront. They are for reducing the compile time by removing the strain on the template engine. My goal is to to remove all templates from the code generated inside the matcher. Only the matcher will have a few template arguments.

The currently larger compile time is due to the fact that the template engine can no longer identify common instantiations. The current approach generates a different set of functions/matchers for each regular expression. In the previous setup the template engine could cache all previous instantiations and therefore safe some time.

My hope is that generating a C code equivalent will improve compile time.

hsutter commented 3 months ago

The parsing and generation of the regex takes also nearly zero time. Adding the generated code with t.add_member(...) takes most of the time of the metafunction which is 0.5 seconds. The remaining 206 seconds are after the metafunction has been applied. So it is just the sema analysis and the writing of the code. It seems that the sema analysis is taking up all the time.

Ah, got it. So the major issue seems to be #887. I've updated the answer above to correct that. I'll focus my efforts there. Thanks for re-clarifying!

MaxSagebaum commented 3 months ago

Short version: It seems that the code generation approach for the regex metafunction does not provide the benefits we hoped for. Even a version were nearly no templates are used does not have a significant improvement over the template version. Most of the compilation times is used in the optimization step.

In addition the template approach allows the compiler to optimize duplicated template instantiations away. In code generation version does not allow this.

Two questions:

Any suggestions on how to optimize the code generation so that the compiler can optimize it faster?
Should we stay with the template version?

Long version with details:

I did further reduce the number of templates required in the new implementation and I had a look into the compilation performance of gcc.

It seems that reducing the number of template parameters and making the code more like a C implementation does not help the compilation performance. I used the parameter `-ftime-report` to get some insights into the compiler. There seems to be a bug there so that it does not work for the old template version. 39 different regexes: version	gen	compile	opt and gen	callgraph function expansion	template instantiation
old template_gen	0.02	16.74
current	0.32	23.64	16.01	11.68	2.29
current all args templates	0.31	25.4	17.39	12.7	2.36
current no args templates	0.24	27.95	19.57	14.66	2.43

The current version seems to be the optimal one. Although, the differences are not that much.

I took a look into the overhead from the test suite. For this I produced a temprary version where no logic is generated. (Matching always fails here.) It seems that the test suite generates quite an overhead. Currently, the result from the regex depends on the generated matcher, so the template argument for the test status evaluation is always different. In order to speed up the regex tests in general and to have a better comparison I will rewrite this in the future so that the match result no longer depends on the matcher. In the table below this is indicated with the `new_search` postfix, which is a hack to test this. 39 different regexes: version	gen	compile
no_logic	0.16	11.98
current	0.25	17.72
no_logic new_search	0.16	8.6
current new_search	0.24	10.55

One advantage of the template approach is automatic deduplication of the template instantiations. In the current version we are always generating a new object/type. This prevents the compiler to automatically identify common template instantiations. Each instantiation of e.g. the test evaluation function will be different. I created a test with 100 times the simple regex abc. The old template version is much faster here. I manually did a deduplication by creating the matcher only once and using it in the different regex instantiations. But even with the manual deduplication the template approach has faster compile times.

version	gen	compile	opt and gen	callgraph function expansion	template instantiation
no_logic	0.11	26.6	18.28	10.81	2.84
current	0.47	47.34	34.99	25.84	4
current manual deduplicate	0.02	12	7.15	5.1	1.37
template_gen	0.04	5.03	2.17	1.65	0.65

Here the current generation for the regex ABCD|DEFGH|EFGHI|A{4,}:

regex_0: cpp2::regex::regular_expression<char, regex_0_matcher> = ();
regex_0_matcher : type = {
 wrap: <Iter, CharT> type = {
  context: type == cpp2::regex::match_context<CharT, Iter, 1>;    func_1: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'A'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'B'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'C'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'D'>(r.pos, ctx) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx);
        return r;
      }
    }
    func_2: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'D'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'E'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'F'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'G'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'H'>(r.pos, ctx) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx);
        return r;
      }
    }
    func_3: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'E'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'F'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'G'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'H'>(r.pos, ctx) { r = ctx.fail(); return r; }
        if !cpp2::regex::char_token_matcher<char, 'I'>(r.pos, ctx) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx);
        return r;
      }
    }
    func_5: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);
        if !cpp2::regex::char_token_matcher<char, 'A'>(r.pos, ctx) { r = ctx.fail(); return r; }

        r = other(r.pos, ctx);
        return r;
      }
    }
    func_4: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);

        r = cpp2::regex::range_token_matcher<char, 4, -1, 2>::match(r.pos, ctx, func_5(), cpp2::regex::no_reset(), other, func_6());
        return r;
      }
    }
    func_6: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);

        r = other(r.pos, ctx);
        return r;
      }
    }
    func_0: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);

        r = cpp2::regex::alternative_token_matcher<char>::match(r.pos, ctx, other, func_7() , func_1(), cpp2::regex::no_reset(), func_2(), cpp2::regex::no_reset(), func_3(), cpp2::regex::no_reset(), func_4(), cpp2::regex::no_reset());
        return r;
      }
    }
    func_7: @struct type = {
      operator(): (this, cur: Iter, inout ctx: context, other) -> cpp2::regex::match_return<Iter> = {
        r := ctx.pass(cur);

        r = other(r.pos, ctx);
        return r;
      }
    }
  entry: (cur: Iter, inout ctx: context) -> cpp2::regex::match_return<Iter> = {
    ctx.set_group_start(0, cur);
    r := func_0()(cur, ctx, cpp2::regex::true_end_func());
    if r.matched { ctx.set_group_end(0, r.pos); }
    return r;
  }
get_named_group_index: (name) -> int = {
  _ = name;
  return -1;
}
}
  to_string: () -> std::string = { return "ABCD|DEFGH|EFGHI|A{4,}"; }
}

hsutter commented 3 months ago

Thanks for this data!

the template approach allows the compiler to optimize duplicated template instantiations away

Do you have a sense for whether this effect mostly applies to stress test cases with lots of regexes (e.g., the Haskell POSIX suite with >400 regexes), and/or very complex regexes? Do you still see this effect in examples that use only a handful of not-super-complex regexes?

For example, when cppfront used regexes, it had only four uses, each of which was a "^art|^bot|^car" kind of list. I think this was the longest:

        "^alignas|^alignof|^asm|^as|^auto|"
        "^bool|^break|"
        "^case|^catch|^char16_t|^char32_t|^char8_t|^char|^co_await|^co_return|"
        "^co_yield|^concept|^const_cast|^consteval|^constexpr|^constinit|^const|^continue|"
        "^decltype|^default|^double|^do|^dynamic_cast|"
        "^else|^enum|^explicit|^export|^extern|"
        "^float|^for|^friend|"
        "^goto|"
        "^if|^import|^inline|^int|^is|"
        "^long|"
        "^module|^mutable|"
        "^namespace|^noexcept|"
        "^operator|"
        "^private|^protected|^public|"
        "^register|^reinterpret_cast|^requires|^return|"
        "^short|^signed|^sizeof|^static_assert|^static_cast|^static|^switch|"
        "^template|^this|^thread_local|^throws|^throw|^try|^typedef|^typeid|^typename|"
        "^unsigned|^using|"
        "^virtual|^void|^volatile|"
        "^wchar_t|^while"

Two questions:

Any suggestions on how to optimize the code generation so that the compiler can optimize it faster?

I don't know if this would help, but one thought that comes to mind is the technique of overloads that delegate the heavy work to a single parameterized implementation (sometimes type-erased void*). I don't know if that might apply here?

Should we stay with the template version?

It still seems like it should be possible to do better. Maybe the above ideas help?

Aside: I think I'm getting to a decent first performance improvement on the cppfront side (for get_declaration_of, with more optimization to come), though I know that's not where the main perf concern is.

hsutter commented 3 months ago

Aside: I think I'm getting to a decent first performance improvement on the cppfront side (for get_declaration_of, with more optimization to come), though I know that's not where the main perf concern is.

I've now pushed the commit with that performance improvement: 6a2a809c8f172cf5facd8f7d81d4bd60194794cd

When you have a chance, I'm curious how much it helps your cppfront run time issue. Thanks again for the feedback and help to improve this.

MaxSagebaum commented 3 months ago

I did a test with the 6a2a809 update:

	gen	compile
Current (8bdd23d)	130.09	254.51
with performance update (6a2a809)	90.09	253.7

So a performance increase by a factor of 1.44. Very nice.

If you want to have a "quick" evaluation by yourself, you can merge with 8bdd23d and run time cppfront regression-tests/pure2-regex.cpp2

Do you have a sense for whether this effect mostly applies to stress test cases with lots of regexes (e.g., the Haskell POSIX suite with >400 regexes), and/or very complex regexes? Do you still see this effect in examples that use only a handful of not-super-complex regexes?

I also though about this. I will try to create a synthetic test for this. The regex you have posted would not benefit from such common optimizations.

I don't know if this would help, but one thought that comes to mind is the technique of overloads that delegate the heavy work to a single parameterized implementation (sometimes type-erased void*). I don't know if that might apply here?

I do not know if I understand you correctly here. But what you suggest would boil down to a library approach.

In @regex call the parsing function of the library or generate something the library can understand and which is already optimized.
When the regex is called "load" the library and provided the precompiled regex.

Is this what you had in Mind? This would go away from the concept of the compile time regular expressions.

hsutter commented 3 months ago

I will try to create a synthetic test for this. The regex you have posted would not benefit from such common optimizations.

In case you want to try the actual three regexes cppfront replaced with find_if over a `vector', here are the other two (from #514 diffs):

"^i8|^i16|^i32|^i64|^longdouble|^longlong|^u8|^u16|^u32|^u64|^ulong|^ulonglong|^ushort"

"^char16_t|^char32_t|^char8_t|^char|^double|^float|^int|^long|^short|^signed|^unsigned"

I do not know if I understand you correctly here. But what you suggest would boil down to a library approach.

I meant still at compile time, though. What I was wondering was whether the same things that are being folded by template instantiations could also be folded by generating code that delegates common implementations to shared compile-time (constexpr) functions. I don't have a concrete example though, just brainstorming.