rchain / rchip-proposals

Where RChain improvement proposals can be submitted
Apache License 2.0
8 stars 5 forks source link

Regular expression string validation in Rholang #19

Open fabcotech opened 4 years ago

fabcotech commented 4 years ago

Introduction

We must have the possibility to validate strings in rholang.

This is a needed feature for Dappy main network (no urgent though).

rchain-names is a name system coded in rholang, Dappy uses this name system to manage name purchases/selling/renewals, just like for the domain name system.

Right now the rholang code names.rho checks only the String type, therefore anyone is able to register names like "aaa", "aaaa123", "aBC", "a?!..", "eeꀀ€", and maybe even non-latin characters "的的的".

So we must be able to validate strings in a regexp-like way.

Note: this proposal is dedicated to string validations and not string manipulation like .toLowerCase().

Danger

Regexp validation can be very much resource intensive, so if we provide a transparent rholang -> scala bridge, it will be a very dangerous situatin, moreover we do not have an easy way to price a regexp operation.

See https://snyk.io/blog/redos-and-catastrophic-backtracking/

Motivating Examples

Strings in rholang already have functions like .slice(0, 3). We cannot do a one-to-one bridge like the following "abc".test(/[a-z]/g)

So my proposition is a limited set of regexp patterns, and therefore validation capabilities much more limited than regexp, but for which a pricing mechanism is easier to setup.

"abc".test(["a-z"]) // true
"abcD".test(["a-z"]) // false
"abcD".test(["a-z", "A-Z"]) // true
"abcD8".test(["a-z", "A-Z"]) // false
"abcD8".test(["a-z", "A-Z", "0-9"]) // true
// from now the patterns that are not one of "a-z", "A-Z", "0-9" must be a litteral match
"abcD8?".test(["a-z", "A-Z", "0-9"]) // false
"abcD8?".test(["a-z", "A-Z", "0-9", "?"]) // true
// etc...

There is no way to validate that a string starts with x or ends with x. Or that a string respects a certain pattern followed by another pattern, just dead-simple string validation.

Design

Just like every changes in rholang, this implies a "hard-fork" and that all nodes upgrade. It could be triggered at a specific block height, independently of release number.

Drawbacks

We must be very careful https://snyk.io/blog/redos-and-catastrophic-backtracking/

Alternatives

"aa".containsLowercases() // true
"aa".containsNonAscii() // false
"aâ".containsNonAscii() // true
"aa".containsUppercases() // false
"aa".containsNumbers() // false
"aa12".containsNumbers() // true
"aa12".containsSpecialCharacters() // false
"aa12?".containsSpecialCharacters() // true
jimscarver commented 3 years ago

I suggest file pattern (glob) matching to avoid redos attacks

9rb commented 3 years ago

According to @nzpr we need to consider how this interacts with cost accounting. i.e. can we cost account for this easily ?

fabcotech commented 3 years ago

I think it is not that difficult to account.

If we do it the first way "abcD".test(["a-z", "A-Z"]) then the cost could be [cost proportionnal to length] * [number of validations]. In this case their are 2 regexp validations to perform.

If we do it the second way "abcD".containsUppercases() then their is a cost proportionnal to string length each time the function is called.

How about that ?