I want to ask whether this proposal will include any strategies to combat/prevent super-linear backtracking caused by sequences?
Right now, Unicode property escapes (UPE) are atomic (as are all characters, character sets, and character classes). With this proposal, UPEs evaluating to sequence sets will require backtracking.
This is a big problem because it is not obvious whether a UPE evaluates to a character set (no backtracking) or a sequence set (backtracking). This makes it incredibly easy for developers to (accidentally) create unsafe regexes. This can then be exploited in an ReDoS attack.
Example 1:
Suppose a Unicode property Foo equal to the sequence set xyz|xy|z and a JS RegExp /\p{Foo}+$/u. If this regex gets compiled into something equivalent to /(?:xyz|xy|z)+$/u, then the regex will be susceptible to exponential backtracking for all input strings of the form "xyz".repeat(n) + "a".
Human detection
It is practically impossible for a human to detect whether a UPE will cause super-linear backtracking. Even if a human knows that a particular UPE evaluates to a set of sequences, without knowing the exact set of sequences, it is impossible to determine whether the UPE will cause super-linear backtracking.
Malicious entities can easily exploit this fact, especially in the large JS open-source ecosystem. A malicious contributor can hide exponential backtracking using sequences. Since this is partially impossible for a human to detect, the unsafe regex will likely go unnoticed.
Atomic UPEs
Please note that UPEs evaluating to sequence sets cannot be made atomic. Backtracking is required for correctness, AFAICS.
Example 2:
The regex /\p{Foo}z/u has to be equivalent to /(?:xyz|xy|z)z/u in order to accept the string "xyz". The UPE cannot be atomic.
I want to ask whether this proposal will include any strategies to combat/prevent super-linear backtracking caused by sequences?
Right now, Unicode property escapes (UPE) are atomic (as are all characters, character sets, and character classes). With this proposal, UPEs evaluating to sequence sets will require backtracking.
This is a big problem because it is not obvious whether a UPE evaluates to a character set (no backtracking) or a sequence set (backtracking). This makes it incredibly easy for developers to (accidentally) create unsafe regexes. This can then be exploited in an ReDoS attack.
Example 1: Suppose a Unicode property
Foo
equal to the sequence setxyz|xy|z
and a JS RegExp/\p{Foo}+$/u
. If this regex gets compiled into something equivalent to/(?:xyz|xy|z)+$/u
, then the regex will be susceptible to exponential backtracking for all input strings of the form"xyz".repeat(n) + "a"
.Human detection
It is practically impossible for a human to detect whether a UPE will cause super-linear backtracking. Even if a human knows that a particular UPE evaluates to a set of sequences, without knowing the exact set of sequences, it is impossible to determine whether the UPE will cause super-linear backtracking.
Malicious entities can easily exploit this fact, especially in the large JS open-source ecosystem. A malicious contributor can hide exponential backtracking using sequences. Since this is partially impossible for a human to detect, the unsafe regex will likely go unnoticed.
Atomic UPEs
Please note that UPEs evaluating to sequence sets cannot be made atomic. Backtracking is required for correctness, AFAICS.
Example 2: The regex
/\p{Foo}z/u
has to be equivalent to/(?:xyz|xy|z)z/u
in order to accept the string"xyz"
. The UPE cannot be atomic.