moodmosaic / Fare

Port of Java dk.brics.automaton and xeger, mostly used for generating strings that match a specific regular expression.
http://www.brics.dk/automaton/
MIT License
180 stars 43 forks source link

a{0,50} generate short string #54

Open gianmarialari opened 4 years ago

gianmarialari commented 4 years ago

If I write a regex expression like a{0,100} I expect that Xeger.Generate() generates one of the possible matching sequences ("a", "aa","aaa","aaa" ...... "aaaaaaaaaaaaa[....]aaaaaaaaaaaaaa").

But Xeger.Generate() almost never generate sequence longer than 15. On stackoverflow Issue generating multiple occurrence with Fare/Xeger they told me

It looks like Xeger is randomly selecting possible transitions at each step and then appending the string matching that transition to the result. For your regex, once the matching string has 1 a, there are two possible allowed transitions: "Add another a" or "End of string". [...]

So, If I understood correctly this means that the probability to get long string is tremendously low.

Is there any simple way to make a{0,100} really generate sequences between 0 and 100 characters long? (I mean, with similar frequency:)). Thank you, g.

moodmosaic commented 4 years ago

Thank you for reporting this, @gianmarialari.

This look like a xeger issue to me, and it'd be better if we could ping the maintainer(s) of the current upstream/java version. I believe, they can be found at https://github.com/bluezio/xeger.

I'd be really interested hearing any thoughts on this—perhaps this is something that has been already improved, and so we can just update the C#/.NET fork with what has changed over there.

gianmarialari commented 4 years ago

Thank you for your answer!

I made some test also with the java library. This is the code I used:

import nl.flotsam.xeger.Xeger; class ExampleProgram { public static void main(String[] args){ String regex = "a{0,100}"; Xeger generator = new Xeger(regex); for (int i =0; i<30;++i){ String result = generator.generate(); System.out.println(result); } } }

But the longest string generated is always ~10 characters.

I will try to contact them at https://github.com/bluezio/xeger.

Thank you, Gianmaria

On Wed, 13 Nov 2019 at 15:29, Nikos Baxevanis notifications@github.com wrote:

Thank you for reporting this, @gianmarialari https://github.com/gianmarialari.

This look like a xeger issue to me, and it'd be better if we could ping the maintainer(s) of the current upstream/java version. I believe, they can be found at https://github.com/bluezio/xeger.

I'd be really interested hearing any thoughts on this—perhaps this is something that has been already improved, and so we can just update the C#/.NET fork with what has changed over there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/moodmosaic/Fare/issues/54?email_source=notifications&email_token=AC6W5QWZOTGR5IXWNEAVZYTQTQFLJA5CNFSM4JLV5EP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOED6KHPQ#issuecomment-553427902, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC6W5QT6EJOHJWL5BSLMYBLQTQFLJANCNFSM4JLV5EPQ .

moodmosaic commented 4 years ago

Great, just saw it, https://github.com/bluezio/xeger/issues/3. Let's see what we get back.

gianmarialari commented 4 years ago

Ciao Moodmosaic,

I post here a small program to transform a string like "a{n1 ,m1} bill{ n2, m2} carl{3}" to "a{r1} bill{r2} carl{3}" where r is a random number between {n,m}.

using System;
using System.Text.RegularExpressions;

namespace RegexQuantifier
{
    class Program
    {
        static string ConvertQuantifier(string input)
        //Convert a string containing any occurence of "{n,m}" in "{r}" with r=rnd(n,m);
        {
            string result = input;
            foreach (Match match in Regex.Matches(input, pattern: $@"\{{\s*\d+\s*,\s*\d+\s*\}}"))
            {
                string quantifier = match.Groups[0].Value;
                int min = int.Parse(Regex.Match(input: quantifier, pattern: $@"\d+").Value);
                int max = int.Parse(Regex.Match(input: quantifier, pattern: $@"\d+").NextMatch().Value);
                int r = new Random().Next(min, max + 1);
                result = Regex.Replace(input: result, pattern: quantifier, replacement: "{" + r.ToString() + "}");
            }
            return result;
        }
        static void Main(string[] args)
        {
            string input = "a{10 ,20} bill{ 0,   20} carl{3}";
            Console.WriteLine("Source string: " + input);
            Console.WriteLine("Output string: " + ConvertQuantifier(input));
        }
    }
}

My program probably contains few errors, it's surely not efficient and it could surely be better written but I hope other can enjoy it.

Thank you Moodmosaic. G.

moodmosaic commented 4 years ago

Thank you, @gianmarialari :+1:

gianmarialari commented 4 years ago

@moodmosaic, here is a new version of the previous program.

The ConvertQuantifiers function is written in a more modular way, and hopefully a bit clearer. More important it fixes a bug. Unfortunately I'm not a regex expert so I'm not able to say if it works with all the regex string, but if I understood correctly regex quantifiers syntax, it should :)

I hope others will found it useful.

using System;
using System.Text.RegularExpressions;

namespace RegexQuantifier
{
    class Program
    {
        static string ConvertQuantifiers(string input)
        //Convert a string containing one or more occurences of {n,m} in {r} with r=rnd(n,m)
        {
            string EscapeQuantifiers(string inputQ) => inputQ.Replace($@"{{", $@"\{{").Replace($@"}}", $@"\}}");
            string TransformMinMaxToR(string inputMM) //Transfom {n,m} to {r} with r=rnd(n,m)
            {
                int min = int.Parse(Regex.Match(input: inputMM, pattern: $@"\d+").Value);
                int max = int.Parse(Regex.Match(input: inputMM, pattern: $@"\d+").NextMatch().Value);
                int r = new Random().Next(min, max + 1);
                return "{" + r.ToString() + "}";
            }

            string result = input;
            foreach (Match match in Regex.Matches(input, pattern: $@"\{{\s*\d+\s*,\s*\d+\s*\}}"))
            {
                string minMax = match.Groups[0].Value;
                string r = TransformMinMaxToR(minMax);
                string minMaxExcaped = EscapeQuantifiers(minMax);
                result = Regex.Replace(input: result, pattern: minMaxExcaped, replacement: r);
            }
            return result;
        }
        static void Main(string[] args)
        {
            Console.WriteLine("Given a regex pattern it replaces each quantifiers {n,m} to {r} with r=rnd(n,m)");
            Console.WriteLine("Example:");
            string input = "a{10 ,20} bill{ 0,   20} carl{3} (a[bc]{3,40})?xyz|ghi{0,10}.*hello";
            Console.WriteLine("Input : " + input);
            Console.WriteLine("Output: " + ConvertQuantifiers(input));
        }
    }
}
moodmosaic commented 4 years ago

That's great! Perhaps we can add some examples in the library!

gianmarialari commented 4 years ago

If you think I can help please let me know, I will be glad to help. Ciao, g.