pemistahl / grex

A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
https://pemistahl.github.io/grex-js/
Apache License 2.0
7.27k stars 173 forks source link

Performance enhancements to enable large inputs #147

Open vitorsr opened 1 year ago

vitorsr commented 1 year ago

Hi,

I want to run this on a billion-sized low entropy case-insensitive UTF-8 input with a variable number of characters up to approximately 256 bytes each.

The data comply to an specification format but empirically it is very low entropy, only uses a very narrow subset of the specification and also a very narrow subset of UTF-8 characters.

Think of URI schemes like mailto [1]. The standard yields ample provisions for the format but empirically only a narrow subset is used.

However DFA is very slow even for very small samples. In addition even though I am not very proficient in Rust, I cannot see how to introduce MP or MPI constructs. DFA minimization also returns impractical expressions and I cannot see how to relax, penalize or restrict the size of the expression, coalesce (arbitrary) character ranges or groups, include prior knowledge of the superset expression defined by the specification format, or take advantage of data preorderings.

Finding an expression to match legal and legitimate data based on the empirical set is useful in determining diverging datum.

As an example, if some subexpression is know by the specification to be constrained to \w characters and in the data all lowercase alphabet characters except one were present, it would be fine to merge ranges to [a-z]. The point here would be that even though this subexpression can be all \w characters per specification, only [a-z] would be used in practice - and if some new datum had numbers, even though the specification allows it, it would be kind of weird given a billion samples did not.

The constraints imposed by the DFA solution as implemented are too strict in this regard and I cannot see where to go from here.

To summarize, I would greatly appreciate if you could point me towards scaling this library to work on large inputs.

I hope this tiny discussion interests you to some degree.

Vítor

[1] https://datatracker.ietf.org/doc/html/rfc6068#section-2

pemistahl commented 1 year ago

Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.

Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.

vitorsr commented 1 year ago

Thanks Peter!

I appreciate you taking the time to reply and not dismissing this issue completely.

Many apologies if this sounded absurd at first glance but it should be generally feasible if there's a way to distribute computation.

Using rayon [1] which is already a transitive dependency of this project because of criterion [2, 3] would be a great and easy start to introduce multiprocessing at a single node level.

After that the hard part would be to distribute to more than one node. In scientific computing we generally use OpenMPI for that, I wouldn't know it in Rust. There is mpi [4], I guess that would do.

In relation to memory it just need not grow beyond a single node's total memory in single and parallel processing code.

The hardest part however is on a project level determining what needs to be changed from an architecture standpoint to include parallel constructs.

As an example if it were possible to create expressions for independent subsets of input data and then merge the expressions this would completely solve the problem even if DFA is slow. MPI is not needed in this case since each node can independently generate expressions from each independent subset and then be merged. Optimally though it would be nice to introduce some relaxation to lower computational cost such as merging ranges.

I am aware this is too much and outside the scope of your project so please feel free to read this issue just as a reminder to consider larger-sized problems.

I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.

That would be incredible, thank you!

Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.

If I do I'll be sure to send it your way.

[1] https://docs.rs/rayon/latest/rayon/index.html

[2] https://github.com/pemistahl/grex/blob/v1.4.1/Cargo.lock#L209-L233

[3] https://github.com/pemistahl/grex/blob/v1.4.1/Cargo.lock#L909-L919

[4] https://docs.rs/mpi/latest/mpi/index.html

On Thu, Jan 19, 2023 at 7:02 AM Peter M. Stahl @.***> wrote:

Hi Vítor, thanks for reaching out to me. This tool is simply not meant to scale to billion-sized inputs. I would not even know how to accomplish that. For DFA generation, I use a separate library so I'm limited by that one. The tool's primary purpose is to help creating regular expressions with a focus on correctness (which is hard enough), not performance. I will work on a new feature that allows to create various character classes and ranges of characters so that more generalized expressions are possible.

Feel free to open a PR if you see a way to improve performance or other aspects. Even though I'm relatively proficient in Rust, I'm far from being an expert.

— Reply to this email directly, view it on GitHub https://github.com/pemistahl/grex/issues/147#issuecomment-1396715512, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJW2VYO5OEBYDDCKM6SPQ5DWTEGKDANCNFSM6AAAAAAT7SFFEM . You are receiving this because you authored the thread.Message ID: @.***>

pemistahl commented 1 year ago

Thank you for your suggestions. Looks interesting. I will leave this issue open and come back to it as soon as I continue working on performance improvements.

vitorsr commented 1 year ago

I changed the title so it looks less like clickbait to outside viewers.

I'll share a few test sets to consider and more as they come along.

$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -r -
^(?:a{255}b|a{254}b{2}|a{253}b{3}|a{252}b{4}|a{251}b{5}|a{250}b{6}|a{249}b{7}|a{248}b{8}|a{247}b{9}|a{246}b{10}|a{24...

Expected output: ^[ab]{256}$.
grex does not yield the expected output. Adding more characters to the repeated alphabet or increasing the size of combinations chokes the program. There is insufficient character set (or interval) detection.

$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -
^(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:a(?:...

Expected output: ^[ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab][ab....
grex does not yield the expected output. The expression builder favors nesting to a fault without the repetitions or substitution flags.

$ python3 -c 'from itertools import combinations_with_replacement;'\
'_ = [print("".join(i)) for i in combinations_with_replacement("ab", 256)]' | grex -w -r -
^\w{256}$

Expected output: ^\w{256}$.
grex yields the correct output. However this shouldn't take so long - after substitution all entries are the same. There needs to be deduplication guards to avoid choking the program.

$ grex -r '<{{ aaa }}>' '<{{ aab }}>' '<{{ aac }}>' '<{{ abb }}>' '<{{ abc }}>' '<{{ acc }}>' '<{{ bbb }}>' '<{{ bbc }}>' '<{{ bcc }}>' '<{{ ccc }}>' '</{{ aaa }}>' '</{{ aab }}>' '</{{ aac }}>' '</{{ abb }}>' '</{{ abc }}>' '</{{ acc }}>' '</{{ bbb }}>' '</{{ bbc }}>' '</{{ bcc }}>' '</{{ ccc }}>'
^<(?:/\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>|\{{2} (?:(?:ab|b{2})c|a{2}[bc]|a(?:b{2}|c{2})|bc{2}|a{3}|b{3}|c{3}) \}{2}>)$

Expected output: ^</?\{{2} [abc]{3} \}{2}>$.
grex does not yield the expected output. This case illustrates insufficient common prefix and suffix detection as well as the issues pointed out in Case 1.1 and 1.2.

These test cases also only use a single thread. Lots of operations including (independent) iterations and sorting could be exchanged by parallel counterpart calls for a free win.

scheidelerl commented 1 year ago

Hey, so far I like the tool very much, but if I may I would like to give more input on this topic.

I have a very simple data set where the mixture is initial digits followed by one or two lowercase non-digit characters or followed by one or two lowercase non-digit characters with a roman numeral from I-VI in parentheses.

I would offer the following regex. Surely this can also be done more simply, but here:

^[0-9]+[a-z]{0,2}(\s\((IV|V?I{0,3})\))?$

regex101 analysis.

Test file: test.txt

Here's what I got with grex -f text.txt:

^(?:(?:(?:13|69)6 \(II|440 \(II|725 \(II|(?:2(?:56j|85|98)|4(?:2(?:2b|9)|1[07]|4[38])|7(?:10e|02|22)|1(?:63|75|81)|3(?:43|12)|5(?:15|83)|6(?:0[0358]|1[34])|89[08]|502) \(I|(?:13|69)6 \(V|(?:(?:783|(?:(?:23|31)|44))|883) \(I)I?\)|(?:296|409|708|821) \(I(?:II?\)|\))|9(?:(?:57i|22) \(II?\)|9(?:8 \(I(?:II?\)|\))|9a|9|[0-7])?|5(?:7[ad-fh]|[0-689])|7(?:7[a-h]|[0-689])|57?|77?|1[0-9ab]|2[013-9]|6[013-9a]|[0348][0-9]|1|2|6|[0348])?|(?:255 \(I[IV]|(?:13|69)6 \(IV|440 \(IV|725 \(IV)\)|2(?:5(?:5 \(I\)|6(?:a[a-s]|[bce-ik-oq-y])|7[a-c]|[0-489])|1[0-57-9]|4[0-9a-c]|6[0-4689]|7[1-689]|8[0-46-8]|3[0-9]|9[0-579])|(?:13|69)6 \(I\)|440 \(I\)|7(?:2(?:5 \(I\)|[1346-9])|1(?:0(?:[ab][a-z]|[df-rt-z])|[1-57-9])|10c[a-i]|7(?:4(?:a[a-i]|[b-kw-z])|[0-35-8])|3(?:9[ac-g]|[0-8])|8(?:3[ab]|[124-69])|0[013-79])|1(?:0(?:1(?:9[a-j]|[0-8])|2[01346-9]|6[4-9a-f]|7[013-9])|1(?:0[0-7]|[1-9])|3(?:6b|[0-579])|6(?:2[bc]|[014-689])|2[0-25-9]|4[013-689]|5[235-9]|7[0-46-9a-c]|8[02-9])|(?:1(?:060|54|67)|277|35[58]|7(?:16|20|8[78]))[ab]|(?:1(?:0(?:22|63|72)|50)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80))a|1(?:0(?:19?|2|6|7)?|10?|62?|2|3|4|5|7|8)?|2(?:5(?:6a?)?|1|4|6|7|8)?|7(?:1(?:0[ab]?)?|10c|7(?:4a?)?|39?|0|2|8)?|1(?:060|54|67)|1(?:0(?:22|63|72)|50)|105[013-9]|1(?:38|47)[a-e]|3(?:2(?:4[a-eg-l]|[0-35-9])|45[a-e]|8(?:4[abd-f]|6[a-e]|[0-357-9])|0[013-9]|3[0-79]|4[046-9a-k]|5[0-49]|6[1-9]|7[0-46-9]|1[013-79])|4(?:2(?:2[ac]|[0135-8])|0[0-8]|1[1-689]|3[0-46-9ab]|4[124-79])|108[02-9]|(?:1(?:42|51)|267)[a-d]|(?:270|3(?:38|56|75))[a-c]|79(?:0[ab]|[1-9])|8(?:3(?:6[a-fh-mq-z]|[0-57-9])|4(?:5[a-d]|8[a-c]|[0-4679ab])|5(?:0[ab]|[1-9])|7(?:5[a-l]|[0-46-9])|8(?:3[a-c]|[0-24-9])|0[02-8]|1[02-9]|2[02-9]|6[0-9a]|9[1-79a]|[a-h])|(?:1(?:0[0349]|9)|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])[0-9]|105|1(?:38|47)|3(?:24?|45|8(?:4|6)?|0|3|4|5|6|7)?|4(?:22?|0|1|3)?|108|1(?:42|51)|267|270|3(?:38|56|75)|277|35[58]|7(?:16|20|8[78])|790?|8(?:36?|4(?:5|8)?|50?|75?|0|1|2|6|9)?|1(?:0[0349]|9)|265|3(?:4[12]|60)|4(?:24|35)|7(?:79|80)|5(?:1[01346-9]|8[0-24-9]|9[0-9a-d])|6(?:0[124679]|1[0-25-9]|7[0-9c]|9[0-57-9])|76[0-9a]|50[013-9]|5(?:1|8)?|6(?:0|1|7|9)?|76|50|2[02]|39|4[5-9]|5[2-7]|6[2-68]|7[45])$

I tried to find out with the analysis of regex101 if grex follows some kind of rule set in which it is grouped to possibly provide helpful information for the enhancement. Since I didn't really have the skills to get into the code right now. But I love regex. The pattern grex is trying to group is too wild for me to put into words.

The analysis of regex101 indicates that grex has formed a large non-capturing group and 59 alternatives to follow.

I do play around with -r, --min-repetitions and --min-substring-length and -dand -w.

If this comment does not provide important or purposeful information, let me know and I will delete it. I just wanted to share my play around.

pemistahl commented 1 year ago

Hi @scheidelerl, thank you for your valuable input, very much appreciated. :)

I know about the shortcomings of the current algorithm, that's why the issues #48 and #122 already exist. I want to work on these, I'm just lacking free time at the moment. But the project is still active and I will continue working on it. I just can't tell you when.