tc39 / proposal-regexp-named-groups

Named capture groups for JavaScript RegExps
https://tc39.github.io/proposal-regexp-named-groups/
222 stars 21 forks source link

multiple named capturing groups #44

Open Konrud opened 6 years ago

Konrud commented 6 years ago

Perl, Ruby and .NET all allow multiple named capturing groups to share the same name in the regular expression. As of 07.2018, current implementations of named capturing groups in browsers (I've checked it in Chrome 67 and FF 61) don't allow this. So this regular expression for strict date analyze is invalid:

var dateRegExp = /^(?:(?<month>0?2)\/(?<day>[12][0-9]|0?[1-9])|(?<month>0?[469]|11)\/(?<day>30|[12][0-9]|0?[1-9])|(?<month>0?[13578]|1[02])\/(?<day>3[01]|[12][0-9]|0?[1-9]))\/(?<year>(?:[0-9]{2})?[0-9]{2})$/

Do you consider to add support for multiple named capturing groups? I think it may help a lot. If we started to implement it as it in the other languages why don't implement it thoroughly with all the features available?

littledan commented 6 years ago

Hi, thanks for taking the time to write up this suggestion. This proposal is already at Stage 4 and shipped in browsers. Because of that, it's no longer open to further revisions. Any changes from here should be a new part of a new proposal. Maybe we can pursue this as a needs-consensus pull request. For more information, see https://github.com/tc39/ecma262/blob/master/CONTRIBUTING.md

I'm wondering, have you run into the need for this case in practice?

Konrud commented 6 years ago

Actually I have run into the need. One of the examples is when I needed to create regExp for a date recognition as in the example I wrote in the message before. I think there should be more examples of using this. May I ask you why you didn't implement this feature when considered it before? I mean I'm sure you knew about it.

littledan commented 6 years ago

@Konrud Thanks for the report. I'll think about this some more and chat about it with colleagues. It's possible that it was an error on my part to include this early error, and that nobody caught the design flaw.

tophf commented 5 years ago

This is definitely an oversight and it's really really frustrating.

littledan commented 5 years ago

@tophf Sorry about this! How does it come up for you?

tophf commented 5 years ago

Not sure we should defend an acknowledged use case implemented in PCRE - if it's too hard to implement, why not just document the difference and mark it as WAI? Anyway, similarly to the example above, I have a list of | alternates, each describing a unique input syntax flavor.

something like thisSince the alternatives are so different I can't just aggregate them in encompassing named groups like ```js const rules = [ {a: /rx1a/, b: /rx1b/, c: /rx1c/}, {a: /rx2a/, b: /rx2b/, c: /rx2c/}, // .............. ]; new RegExp( `(?${rules.map(_ => _.a.source).join('|')})` + `(?${rules.map(_ => _.b.source).join('|')})` + `(?${rules.map(_ => _.c.source).join('|')})`, 'g') ``` and then process the matched groups - because this would produce bad pairings like a[1]b[3] leading to a wider/narrower/incorrect match. With the current limitation I have to write a parser-like evaluator or embed lengthy protections into each rule against capturing other rule's stuff.
littledan commented 5 years ago

Sorry, I don't understand how relaxing the restriction on reusing group names would solve that problem. Could you give an example in code of making use of this feature, and what you expect the semantics to be?

tophf commented 5 years ago

Not sure you should rely on my explanations. My point was we are just a few devs who bothered to bring this up here, and even though it's kinda good to feel important like I can influence a decision, but the discussed use case already has lots of examples over its long history so an ideal thing to do initially was to investigate and reuse the existing behavior, but as for now at least investigate it instead of relying on me, a random dude, moreover English is not my native language and I'm not good at explaining things.

littledan commented 5 years ago

I appreciate the time you're putting into this issue, and I would like to make sure your feedback is well-represented in our decision-making process. If you can give a few more details, it'd be helpful. I only see one example here, so a second would be really useful in motivation a change. (There was another issue about not including properties for groups that aren't hit, but I think that amounts to a different proposal from that of the OP here.)

tophf commented 5 years ago

My example is just one case out of the thousands existing ones, but okay, here's how it would look like with duplicates allowed (a, b being the group names):

const rules = [
  /(?:foo)?(?<a>\w+)\s*,\s*(?<b>\d+)(?:bar)?/,
  /\W*(?<a>herp|derp)\s*:\s*(?<b>one|two|three)/,
  // ..............
];

const rx = new RegExp('(' + rules.map(r => r.source).join('|') + ')\s*\|\s*', 'g');
for (let m; (m = rx.exec(text));) {
  const {a, b} = m.groups;
  // do something with a and b
}

If the rules are produced from user input with the current JS implementation I would have to scan the entire text per each rule, which could be a lot of times. If the rules are handcrafted, I could combine the first group into a "decider" expression which would be used to scan the entire text once, and on each exec I would choose a corresponding "tail" expression (with sticky flag) which would produce its named group and advance the decider's lastIndex upon success. The second approach is what I meant by "parser-like" in my previous comment.

littledan commented 5 years ago

OK, thanks for explaining, I can see how this comes up in that case. If you can bare with me just slightly longer, I'm curious, can you say a little more about the context that this sort of issue has come up in a code base you're aware of in the past?

tophf commented 5 years ago

I don't think there are any JS repos worth mentioning that stumble on this since everyone knows how limited JS regexp engine is compared to PCRE so people either use a custom extended regexp library or switch to another language altogether. In the future, though, implementing this feature would allow all kinds of customizable scraping of text forms, documents, etc. Personally I think any regexp engine should strive to be as close to PCRE as possible within the constraints of effort/performance/size bandwidth.

hg42 commented 3 years ago

Hi, I am an old perl user and as such I am often puzzled by discussions like this one.

From my POV multiple occurances are mandatory.

One big use case is in parser like situations, especially where you combine alternate syntax rules in one regexp. The date parsing above is a simple example of this. In my programming life, I used this a lot. e.g. yyyy-mm-dd vs. mm/dd/yyyy vs. dd.mm.yyyy

In this parsing use case, you could actually match each parsing rule separately and sequentially. However, you want to combine them into one big regexp (often computed from an array of values), because it matches much faster this way, especially if you have a lot of rules and multiple decision points. The combined regexp is an optimal solution, because it uses the regexp tree to walk through all the rules in parallel (which is the real power of regexps).

In most cases you have subexpressions, where you can distinguish the paths, e.g. for git like commands you usually have a command name and sometimes even subcommand names. You often use a switch on these values and then simply use the other names in that rule to process it's parameters.

Another use case is parsing alternate sequences of the same data, like date example. You often find this in natural languages or human input. In this case you directly use the values.

hg42 commented 3 years ago

https://www.regular-expressions.info/named.html has a section about this topic: "Multiple Groups with The Same Name" that is a nice overview. So, all these implementations are different. This is sad...

Though, I mostly used these two variants:

  • alternative branches, where only one of the groups with the same name matches with (?<a>expr1)|(?<a>expr2) the group a contains the value from the branch that matched
  • collecting group values into arrays (mostly usable with special perl language constructs, so not relevant here)

The last situation I remember, was parsing output lines of compilers and other tools. The expression was build from a collection of expressions that describe output formats e.g.

{
  compiler_a: "(?<file>.+):(?<line>[0-9]+): *(?<message>.*)",
  compiler_b: "in (?<file>.+), at (?<line>[0-9]+): *(?<message>.*)",
  compiler_c: "(?<message>.*), at (?<line>[0-9]+) in file (?<file>.+)",
  ...
}

so I could match all with one expression (simply joined with "|" only once and then cached) and use the three groups. Because I didn't have the feature in javascript, I was forced to use a workaround. Either using a loop (slow) or numbering the names and find that one that wasn't empty (complicated).

bakkot commented 3 years ago

I'm surprised that other languages give the first group which participated, rather than the last. I would expect later ones to clobber earlier ones, as happens when you hit the same group multiple times (/(.)+/.exec("ab")[1] gives b, not a).

hg42 commented 3 years ago

perl -e '"ab" =~ m/(.)+/; print $1' prints b and my tests on https://regex101.com/ also result in b (not sure if they really use the original code) so the tutorial may be wrong or outdated

But your example isn't "Multiple Groups with The Same Name", it's a numbered group. that said, perl -e '"abc" =~ m/(?P<x>.)(?P<x>.)/; print $+{x}' results in a, as described. and regex101 rejects it for all languages

bakkot commented 3 years ago

I know my example is numbered groups. I was drawing an analogy: if you execute a numbered group multiple times, the match object ends up with the value from the last time it is executed. It's surprising, then, that if you execute the groups with the same name on different occasions, the match object ends up with the value from the first match.

hg42 commented 3 years ago

now I understand, I had a similar thought... perl is interesting: https://www.regexplanet.com/advanced/perl/index.html

(?P<x>.)(?P<x>.)
->
$-{x}[0]=a
$-{x}[1]=b
.(?P<x>.)|(?P<x>.).
->
$-{x}[0]=b
$-{x}[1] is undef
(?|.(?P<x>.)|(?P<x>.).)
->
$+{x}=b
bakkot commented 2 years ago

This proposal is finished and the repo is being archived, so discussion can't continue here. I've created a new repo to discuss this proposal, and I invite further discussion and contributions there: https://github.com/bakkot/proposal-duplicate-named-capturing-groups