rust-lang / regex

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.
https://docs.rs/regex
Apache License 2.0
3.47k stars 435 forks source link

Named capture group syntax: `(?<name>exp)` #955

Closed 01mf02 closed 1 year ago

01mf02 commented 1 year ago

Would it be possible to support in addition to the existing syntax (?P<name>exp) for named capture groups also the syntax (?<name>exp)?

My use case for this is that I am currently writing a jq clone called jaq. Recently, I have added support for regular expressions to jaq using the regex crate, which works very well. However, because jq supports only the (?<name>exp) syntax (because of the oniguruma library) and jaq only the (?P<name>exp) syntax (because of the regex crate), it is currently impossible to write regexes with named capture groups that are valid in both jaq and jq.

Apart from this, the (?<name>exp) syntax seems reasonably popular, so apart from my special use case, it might make sense to add support for this syntax. :)

BurntSushi commented 1 year ago

To get this out of the way: this would be a backwards compatible change because currently all forms of (?<name>exp) are invalid syntax due to < being interpreted as a flag. Since < is of course not a flag, it always fails.

I think the original reason why I didn't do this was to 1) match RE2 syntax and 2) I didn't want two ways of doing the same thing. But I do think my stance has softened somewhat over time on this. Another related example is that I plan to relax the escaping rules in order to make the differences between at least the surface level syntax smaller between other regex engines. (For example, right now \/ is forbidden.)

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

I'll note that RE2 specifically only implements (?P<name>exp) syntax and there is this comment in the parser:

  // Check for named captures, first introduced in Python's regexp library.
  // As usual, there are three slightly different syntaxes:
  //
  //   (?P<name>expr)   the original, introduced by Python
  //   (?<name>expr)    the .NET alteration, adopted by Perl 5.10
  //   (?'name'expr)    another .NET alteration, adopted by Perl 5.10
  //
  // Perl 5.10 gave in and implemented the Python version too,
  // but they claim that the last two are the preferred forms.
  // PCRE and languages based on it (specifically, PHP and Ruby)
  // support all three as well.  EcmaScript 4 uses only the Python form.
  //
  // In both the open source world (via Code Search) and the
  // Google source tree, (?P<expr>name) is the dominant form,
  // so that's the one we implement.  One is enough.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

I'm undecided on this personally. @junyer do you have any thoughts here?

junyer commented 1 year ago

It sounds like you and I (and @rsc) are aligned here at least philosophically. And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera. I won't presume to speak for the Rust regex crate, of course, but various Google-related projects won't ever support this unless someone herds those cats successfully... and that someone is very unlikely to be me.

rsc commented 1 year ago

I still basically agree with what I wrote in the RE2 comment long ago. I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

On the surface syntax issue and \/, RE2 and Go follow the general convention originally set by egrep of backslash-letter being special (so you must know what it means or reject it) and backslash-punctuation always being literal punctuation. So \/ and \_ fall out of that rule without being handled explicitly. The code in RE2 looks like:

  if (c < Runeself && !isalpha(c) && !isdigit(c)) {
    // Escaped non-word characters are always themselves.
    // PCRE is not quite so rigorous: it accepts things like
    // \q, but we don't.  We once rejected \_, but too many
    // programs and people insist on using it, so allow \_.
    *rp = c;
    return true;
  }
01mf02 commented 1 year ago

Thanks for your very detailed answers.

I'll also say though that what you're facing here is a surface level problem. There are assuredly many other differences between this regex engine and Oniguruma. How will you deal with those? If compatibility is your ultimate goal, then you probably just need to use Oniguruma itself. Or do you see this more as a "let's get some compatibility, but not all of it" sort of situation? The problem there is that there may be many incompatibilities that are totally silent. (I don't have an Oniguruma environment that I can easily test with at the moment.)

Yes, I see the situation as "let's get some compatibility, but not all of it". (Using Oniguruma from Rust is not really an option for me, all the more because I already have an implementation of regexes using the regex crate.) At least most regexes that I have seen in jq snippets in the wild are fairly simple, so I believe that regex should interpret them the way a jq user expects it to. By far the largest problem, however, are named capture groups, because there are some jq functions that crucially depend on them, in particular capture. Without the named capture group syntax, it is not possible to use capture the same way in jq and jaq.

I am quite sympathetic to this line of reasoning personally. And chasing this sort of "let's just keep adding alternative forms of everything until we capture all the different ways other regex engines do things" will lead us into undesirable territory.

I agree in principle; however, when searching for "regex named capture group", among the first four matches, all mention the syntax (?<, whereas only one site (the first one) additionally (not exclusively) mentions the existence of (?P<. This at least suggests that there might not be such a strong consensus towards the syntax (?P< as the one and only syntax to rule them all.

I also wonder whether you could easily work around this by looking for a (?< and replacing it with a (?P<. You would need to deal with escapes, but I think that might be it? I don't think you'd need to write a full parser. I might be wrong though, I haven't given this a lot of thought.

There might be a lot of tricky cases to handle. Consider:

Given that I am not a regex expert, I would not trust myself to get this right.

And now speaking pragmatically, adding support for (?<name>exp) – or anything else – to RE2 shouldn't happen without initiating a three-phase commit protocol with the Go regexp package, RE2/J et cetera.

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

Would it perhaps be possible to have some opt-in option, for example in ParserBuilder, to enable parsing (?< syntax?

01mf02 commented 1 year ago

I could change my mind given evidence of (1) significant usage of .NET forms or (2) significant environments that only support the .NET forms. It sounds like jq might be one such environment. Reading the other link, maybe Java or Boost has (?...) without (?P...)? It's unclear to me.

Regarding Java, I read at least three sites, all of which exclusively mentioned the (?< syntax.

For Boost, the documentation says that the Perl syntax is the default behaviour, and details that this supports (?< and (?'. Again, no mention of (?P<.

BurntSushi commented 1 year ago

Why is there such a need for synchronisation? Is there some kind of agreement between the Rust regex crate and RE2 to implement precisely the same syntax?

To clarify here, @junyer and @rsc are RE2 maintainers, and RE2, RE2/J and Go's regexp package are all maintained by folks at Google. So those packages I think generally try to stay very strictly aligned.

There is no synchronization promise with those three and the regex crate though. The regex crate does actually have some substantial differences (like the escaping strategy, although I expect that to change in the direction of RE2's) and also support for character class set operations and nested classes and probably a few other minor things. Still though, I value their input and "consistent with RE2" is, overall, something I value. But not over everything else.

01mf02 commented 1 year ago

I see. Thanks for clarifying your synchronisation policy.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

If I can do anything else to convince you about the utility of supporting (?<, please let me know. Aside, I also checked that JavaScript uniquely supports (?<. Furthermore, among two of the most popular regex websites, https://regexr.com/ supports only the (?< syntax and https://regex101.com/ supports both (?< and (?P<. From my research, I have gained the impression that the (?P< syntax is actually more the exception than the norm.

01mf02 commented 1 year ago

I found an interesting bit of history from the Python project that explains among others how the syntax (? came up. It goes further on to explain:

Python supports several of Perl's extensions and adds an extension syntax to Perl's extension syntax. If the first character after the question mark is a P, you know that it's an extension that's specific to Python.

So the P in (?P< stands for a Python-specific extension. In that sense, it reminds me of browser-specific extensions. Like, for example, -moz-animation, which was later standardised and turned into just animation. I suppose that in the same way that people dropped the -moz-prefix, people dropped the P from (?P< as named capture groups proved to be useful beyond Python. Now, keeping to allow the P in the syntax may be justified by compatibility reasons (just like -moz-animation is still accepted in some browsers). At the same time, it would be great to also have a way to express named capture groups without the capital P, which perpetuates that they are a Python-specific extension (which they have ceased to be a long time ago).

BurntSushi commented 1 year ago

@01mf02 The history and original reason for the (?P syntax is indeed interesting, but I think it has almost exactly zero weight on my decision here. Here are the things that matter to me, in no particular order:

  1. Consistency with other regex engines, especially RE2, given the common ancestry.
  2. Keeping the syntax "simple," for some definition of "simple." Having two different syntaxes for accomplishing the same thing is a negative IMO. Basically, what this results in in my experience is that someone learns one syntax, then sees the other syntax and wonders, "wait was I doing it the wrong way? should I switch? what's the difference between them?" We can of course mitigate such things by answering such questions in the docs, but it is remarkably difficult to make such a thing discoverable. It's certainly not something you want to plaster across the introduction, so it tends to get buried in the syntax details. Which is fine... But people are going to get confused. As with other things in this list, I do not value this above everything else. It's just something I consider.
  3. Making the syntax flexible enough to fit into other environments. This is a net positive because it means there's more knowledge transfer from past experience and things tend to "just work" more often than not. I think this is basically what describes your use case here.
  4. There is an overall downside of trying to "make the syntax match other regex engines," because basically other than regex engines that closely and strictly follow an existing specification, no two regex engines behave the same. And so trying to "just make things work" is a long path that doesn't really have an end. I don't think there is a positive of negative here, but it's something to consider.

I think (1) and (2) are where I am at the moment. Unfortunately, there's no real objective criteria to evaluate here.

I am overall leaning towards doing this.

Just on the side: I believe that implementing the (?< syntax implies changing only one line in the code, namely replacing if self.bump_if("?P<") by if self.bump_if("?P<") || self.bump_if("?<"). I would gladly volunteer to submit a PR with this change where I would also write a few tests for the new behaviour. But of course only if you agree that this feature is worth having.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

rsc commented 1 year ago

Have we identified any regexp implementations other than onigurama that don't implement (?P<name>...)?

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

BurntSushi commented 1 year ago

Not sure about (?'name'...) but I found these with some quick searching:

I think that's all I could find at the moment. I think the closest thing to a consensus among non-RE2 engines is "support both (?P<name>...) and (?<name>...)." That seemed like the most common thing, but it's not ubiquitous. A lot of engines support one or the other too. The "support both" is perhaps inflated a bit by the ubiquity of PCRE, which is used as the default regex engine in at least a few places (PHP and Julia come to mind).

BurntSushi commented 1 year ago

Also, is the suggestion to allow both (?<name>...) and (?'name'...) or just the first?

I think the suggestion on the table is just first, as that's what is used by Oniguruma in the context of jq scripts.

The (?'name'...) syntax is one that I very rarely see. I don't think there are any regex engines (that I can recall in my search) that only support (?'name'...).

c-git commented 1 year ago

I hope this comment doesn't distract too much but I really appreciate how @BurntSushi addresses issues raised, explaining his reasoning and so on. I learn so much from just following along and it usually causes me to think about considerations I might have otherwise missed. I just want to say thank you, I really appreciate the time you put into your responses.

01mf02 commented 1 year ago

I second @c-git in that I also value your very detailed responses, @BurntSushi.

And of course I'm happy to read that you are leaning towards implementing my suggestion. I second your observation that (?'name'...) is something that you very rarely see. I've probably seen this syntax more often in documentation than used in actual code. So I am for not implementing this, also to keep the syntax "simple".

Unbelievably, I can't find any authoritative reference for Boost's regex library about what kind of named capture support it has, but some examples in the wild suggest it at least supports (?<name>...) syntax.

The documentation of the boost regex module mentions named capture groups only in the Perl syntax flavour, which says that ?< and ?' are supported. No mention of ?P< here.

I agree that the patch here is likely quite simple, but it is probably not this simple. Whether (?P<name>expr) or (?<name>expr) is used or not needs to show up in the AST somewhere. So there may be some type definition changes here, and potentially even a breaking change for the regex-syntax crate. (Which is okay. I don't like to do it too often, but I am planning to do one soon.)

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

BurntSushi commented 1 year ago

Ah, I see. I suppose it is for round-tripping? If you wish, I could tackle this. I understand that the ast::Group::CaptureName variant would either need to be extended by some bit that indicates the presence of P (would you consider a boolean?), or a new variant (something like CapturePName) could be introduced. What do you think about this?

Yes, round-tripping. The point of the AST is that it exhaustively describes the syntax as it is. Lowering it into something simpler and easier to analyze happens in a second pass. (You'll need to make what is likely a trivial change to the AST->HIR translator, also inside of regex-syntax, to accommodate your changes to the AST.)

I think a new variant for GroupKind seems okay? So rename the existing CaptureName to CapturePName and introduce a new CaptureName variant.

rsc commented 1 year ago

Talked to @junyer a bit, and I think this change make sense to do in RE2 and Go as well. I filed https://github.com/golang/go/issues/58458, and assuming it goes through we'll update RE2 and Go in about a month.

BurntSushi commented 1 year ago

@rsc SGTM! If y'all add support for it then I definitely will as well. We might not line up timing wise, but I think that's okay!

01mf02 commented 1 year ago

Great! I'm very happy that we seem to have reached a consensus on this issue. :) I have opened a PR with my proposed changes. Have a good weekend!

BurntSushi commented 1 year ago

This ended up being a very effective feature request. It caused RE2, Go's regexp package and this crate to all start supporting (?<name>expr) syntax in addition to (?P<name>expr). Nicely done @01mf02!

01mf02 commented 1 year ago

Thanks, @BurntSushi!