neo-project / neo

NEO Smart Economy
MIT License
3.47k stars 1.03k forks source link

Add Regular Expressions #2922

Open cschuchardt88 opened 1 year ago

cschuchardt88 commented 1 year ago

Summary or problem description We need a way to validate strings in a smart contract. Regular expression is the best way to do this and would be the cheapest.

Do you have any solution you want to propose? Add RegEx method to StdLib library for cheap and easy string validation. also have predefined and compiled ones like URL, Numbers, Letters and numbers., to name a few.

Neo Version

Where in the software does this update applies to?

Jim8y commented 1 year ago

i'd love to have it, but it is hard to measure the computational effort required for executing regex operations. It could be very complex and thus be used to DOS the network. Some predefined patterns maybe.

cschuchardt88 commented 1 year ago

Dotnet only compiles Regular expression once. That's why i say have some predefined ones. Also they can be cached as well.

Read more about it here https://learn.microsoft.com/en-us/dotnet/standard/base-types/compilation-and-reuse-in-regular-expressions#compiled-regular-expressions

Jim8y commented 1 year ago

I mean, we need to meaure how long it takes to match, predefined url number things do not sounds like a core-job. What i am thinking is somthing like startwith endwith contains, onlycontains, notcontains

cschuchardt88 commented 1 year ago

substring indexof and format would be nice too. But i see what your saying. Guess it would depend on the length of the string for regex? But couldn't we just burn gas for regex? For example regex would burn certain amount of gas every millisecond for X milliseconds for given length of string and would timeout after x milliseconds if doesn't finish. Also we could make it not run in dynamic scripts. If it would exceed the amount of time for regex to complete you would get an RegexTimeoutException.

Jim8y commented 1 year ago

millisecond very depends on the machines, we can not to it that way. The reason i am thinking break the regex down to couple of commands is that each of the command is easier to be measured in gas consumption.

shargon commented 1 year ago

i'd love to have it, but it is hard to measure the computational effort required for executing regex operations. It could be very complex and thus be used to DOS the network. Some predefined patterns maybe.

Totally agree, difficult to pay for it

cschuchardt88 commented 1 year ago

@shargon can we atleast have what @Liaojinghui said for strings? startswith endswith contains compare format indexof and substring

cschuchardt88 commented 2 months ago

@shargon @Jim8y

Got some good news about this. When you compile regex in dotnet it generates liasm code (which is compiled dotnet). So it can be measured easily and not DDOS able. For example if you do a search for abc regex expression, it will use/generate String.IndexOf for the expression. Also you can add timeouts. Find out more about it here https://youtu.be/ptKjWPC7pqw?t=1609

DDOS ATTACK

using System.Diagnostics;
using System.Text.RegularExpressions;

Stopwatch sw = new();

for (int i = 1; i <= 40; i++)
{
    var r = new Regex($@"^(\w\d|\d\w){{{i}}}$", RegexOptions.None, TimeSpan.FromSeconds(2));

    string input = string.Concat(Enumerable.Repeat("11", i)) + "1";

    sw.Restart();
    r.IsMatch(input);
    sw.Stop();

    Console.WriteLine("{0}: {1}", i, sw.Elapsed);
}

OUTPUT

1: 00:00:00.0084212
2: 00:00:00.0000213
3: 00:00:00.0000101
4: 00:00:00.0000060
5: 00:00:00.0000338
6: 00:00:00.0000146
7: 00:00:00.0000217
8: 00:00:00.0000458
9: 00:00:00.0000749
10: 00:00:00.0002227
11: 00:00:00.0006073
12: 00:00:00.0006753
13: 00:00:00.0012552
14: 00:00:00.0023305
15: 00:00:00.0049467
16: 00:00:00.0094942
17: 00:00:00.0185833
18: 00:00:00.0599559
19: 00:00:00.0778930
20: 00:00:00.2526117
21: 00:00:00.3932629
22: 00:00:00.6679340
23: 00:00:01.2998858
Unhandled exception. System.Text.RegularExpressions.RegexMatchTimeoutException: The Regex engine has timed out while trying to match a pattern to an input string. This can occur for many reasons, including very large inputs or excessive backtracking caused by nested quantifiers, back-references and other factors.
   at System.Text.RegularExpressions.RegexRunner.<CheckTimeout>g__ThrowRegexTimeout|25_0()
   at System.Text.RegularExpressions.RegexInterpreter.Backtrack()
   at System.Text.RegularExpressions.RegexInterpreter.TryMatchAtCurrentPosition(ReadOnlySpan`1 inputSpan)
   at System.Text.RegularExpressions.RegexInterpreter.Scan(ReadOnlySpan`1 text)
   at System.Text.RegularExpressions.Regex.ScanInternal(RegexRunnerMode mode, Boolean reuseMatchObject, String input, Int32 beginning, RegexRunner runner, ReadOnlySpan`1 span, Boolean returnNullIfReuseMatchObject)
   at System.Text.RegularExpressions.Regex.RunSingleMatch(RegexRunnerMode mode, Int32 prevlen, String input, Int32 beginning, Int32 length, Int32 startat)
   at System.Text.RegularExpressions.Regex.IsMatch(String input)
   at Program.<Main>$(String[] args) in C:\Users\Chris\source\repos\ConsoleApp1\ConsoleApp1\Program.cs:line 13

@roman-khimov does golang do something similar? If it can't maybe we shouldn't add it?

Jim8y commented 2 months ago

I have already supported string methods like: startwith endwith, contains, onlycontains, notcontains

but regular expression is still can not be supported in general.

roman-khimov commented 2 months ago

These time limits look nice in general, but 2s on one machine is 10s on another and we need some strictly reproducible results. It's also hard to establish any fair GAS cost for this call, the difference in execution time can de huge, but we can't rely on time for the cost, we can't rely on length either.