pomsky-lang / pomsky

A new, portable, regular expression language
https://pomsky-lang.org
Apache License 2.0
1.29k stars 19 forks source link

Support extended POSIX regexes #11

Open Aloso opened 2 years ago

wy16W2pIilK1xgqN commented 2 years ago

There are many devices that only support ERE, We need it.

Aloso commented 2 years ago

@wy16W2pIilK1xgqN could you explain? What devices are they?

wy16W2pIilK1xgqN commented 2 years ago

A lot , routers and firewalls, For example, all devices of MikroTik

Aloso commented 2 years ago

The problem is that ERE doesn't support non-capturing groups, like

("hello"? | "world"+) "!!"

which compiles to

(?:(?:hello)?|(?:world)+)!!

For ERE, this would have to compile to

((hello)?|(world)+)!!

But this is not equivalent, because it changes the capturing group indexes. So we either need an option to never emit non-capturing groups when compiling to ERE, or we need to make the above code illegal, requiring capturing groups like this:

:(:("hello")? | :("world")+) "!!"

Although the outer capturing group could be avoided by "inlining" the exclamation mark:

(:("hello")? | :("world")+) "!!"
(hello)?!!|(world)+!!

But that could lead to exponential size increase of the generated expression, so probably not a good idea.

Aloso commented 2 years ago

The other problem is that ERE does not allow escaping characters within a character class, so characters need to be rearranged:

['^' 'a'-'z' '\' '-' ']']

will have to be compiled to

[]^a-z\-]

Rules:

Aloso commented 2 years ago

Another problem: Codepoint/C doesn't work (it compiles to [\s\S], which is not supported in ERE), so what are the alternatives?

Aloso commented 1 year ago

The dot is now supported as of Pomsky 0.8. Rewriting the code for compiling character classes is in progress, with the goal of eventually supporting ERE. The only open question right now is how to handle non-capturing groups. Any input for this would be appreciated!

Possibilities are:

  1. disallow non-capturing groups when targeting ERE, requiring users to write :() instead

  2. add an option to silently convert non-capturing groups to capturing groups when targeting ERE; this could be made configurable, e.g. with -Xcapture=always

Both have disadvantages (1. makes pomsky expressions less portable, but 2. makes behavior of pomsky expressions less predictable).

Aloso commented 3 weeks ago

Proposed solution:

A captures mode is added, which is enabled by default. To use non-capturing groups when targeting ERE, this mode must be disabled:

disable captures;
("hello"? | "world"+) "!!"

With this mode disabled, capturing groups (:() and :name()) are not allowed, but the compiler is allowed to produce capturing regex groups (assuming that they won't be used, since their indices do not correspond to anything).

Alternatively, capturing groups can be used in ERE, but compilation will fail if this results in a non-capturing group:

:(:("hello")? | :("world")+) "!!"

A possibility to make this more ergonomic is to allow numbering them explicitly, if you want to match a particular group:

:(:2("hello")? | :3("world")+) "!!"

Here, only the 2nd and 3rd capturing groups are numbered explicitly.