pemistahl / grex

A command-line tool and Rust library with Python bindings for generating regular expressions from user-provided test cases
https://pemistahl.github.io/grex-js/
Apache License 2.0
7.06k stars 170 forks source link

Confused about the actual use case #242

Open emrakyz opened 2 months ago

emrakyz commented 2 months ago

First of all, thanks a lot for this tool.

The idea looks really good and promising but I couldn't understand what the actual use case was. The documentation lacks in terms of various interesting examples.

I have tried to refind some of the patterns I had already written myself just to test grex.

For example I use the below pattern to extract doi addresses from various inputs and/or pdfs: sed -n -E 's/.*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*/doi:\6/p; q'

The actual pattern is this: .*((DOI|doi)((\.(org))?\/?|:? *))([^: ]+[^ .]).*

The part that I aim to capture is (6th captured group which is the DOI Address): ([^: ]+[^ .])

As an example, it captures this part at the end: 10.36227/techrxiv.22659061.v1

I have tried to place lots of valid cases on each line (doi addresses in different forms as in the above regex pattern) to a test.txt file.

I used grex -r -g -c --no-start-anchor -f "test.txt" command. I knew that it couldn't give me a pattern similar to my original one but the resulting output was even much more different than I expected. I got an extremely long regex pattern which also captured unwanted parts (false-positive constants) that would break the command for my actual use case. This is understandable but impossible to avoid without infinite examples that are completely different from each other, except the actual constants.

I have also tried to test different cases in order to refind the regex patterns I had written before with simpler patterns using some made-up examples in the test file. The below pattern that had been written before, could be an example: ^ *([0-9]+).*\s{2,}(.+)$

But grex outputs a pattern that is always wrong and very long; not similar to my actual pattern. The output is not "wrong" in technical sense but definitely not usable to achieve something. It's not even appropriate to be modified to some extent manually and then used. No matter the sample size, this was always the case in my tests.

Even for very basic cases; since we can't be expressive enough, the output is not usable. Without proper expression, this tool can only create patterns which are usable only with almost infinite example cases that cover all possibilities.

The problem is that, - as far as I understand - we can't be expressive enough especially in terms of constants, variables, wanted parts, unwanted parts, the actual main pattern that should captured and all. Without these, I could not find a proper use case but I really want to use this tool in an actual scenario. How can we be more expressive so we can automate creating at least a base pattern to work on?

I have also tried to find the final pattern in a segmented way but failed similarly.

Writing regex is fairly easy for small, simple tasks. What I initially had in mind for this tool was that it would be helpful for us to create regex, for very complex patterns easily in a more efficient, more correct way. Right now I feel like we have a very powerful and robust but useless tool. Is this just "experimental" or a kind of a base that will be used by future tools?

Could you please inform what I do wrong? What is the best practice to solve an actual problem using grex? What type of problems are best to get help from grex?

I probably misunderstood the tool or made a mistake regarding the intended use case.