Open Konrud opened 6 years ago
Hi, thanks for taking the time to write up this suggestion. This proposal is already at Stage 4 and shipped in browsers. Because of that, it's no longer open to further revisions. Any changes from here should be a new part of a new proposal. Maybe we can pursue this as a needs-consensus pull request. For more information, see https://github.com/tc39/ecma262/blob/master/CONTRIBUTING.md
I'm wondering, have you run into the need for this case in practice?
Actually I have run into the need. One of the examples is when I needed to create regExp for a date recognition as in the example I wrote in the message before. I think there should be more examples of using this. May I ask you why you didn't implement this feature when considered it before? I mean I'm sure you knew about it.
@Konrud Thanks for the report. I'll think about this some more and chat about it with colleagues. It's possible that it was an error on my part to include this early error, and that nobody caught the design flaw.
This is definitely an oversight and it's really really frustrating.
@tophf Sorry about this! How does it come up for you?
Not sure we should defend an acknowledged use case implemented in PCRE - if it's too hard to implement, why not just document the difference and mark it as WAI? Anyway, similarly to the example above, I have a list of |
alternates, each describing a unique input syntax flavor.
Sorry, I don't understand how relaxing the restriction on reusing group names would solve that problem. Could you give an example in code of making use of this feature, and what you expect the semantics to be?
Not sure you should rely on my explanations. My point was we are just a few devs who bothered to bring this up here, and even though it's kinda good to feel important like I can influence a decision, but the discussed use case already has lots of examples over its long history so an ideal thing to do initially was to investigate and reuse the existing behavior, but as for now at least investigate it instead of relying on me, a random dude, moreover English is not my native language and I'm not good at explaining things.
I appreciate the time you're putting into this issue, and I would like to make sure your feedback is well-represented in our decision-making process. If you can give a few more details, it'd be helpful. I only see one example here, so a second would be really useful in motivation a change. (There was another issue about not including properties for groups that aren't hit, but I think that amounts to a different proposal from that of the OP here.)
My example is just one case out of the thousands existing ones, but okay, here's how it would look like with duplicates allowed (a
, b
being the group names):
const rules = [
/(?:foo)?(?<a>\w+)\s*,\s*(?<b>\d+)(?:bar)?/,
/\W*(?<a>herp|derp)\s*:\s*(?<b>one|two|three)/,
// ..............
];
const rx = new RegExp('(' + rules.map(r => r.source).join('|') + ')\s*\|\s*', 'g');
for (let m; (m = rx.exec(text));) {
const {a, b} = m.groups;
// do something with a and b
}
If the rules are produced from user input with the current JS implementation I would have to scan the entire text per each rule, which could be a lot of times. If the rules are handcrafted, I could combine the first group into a "decider" expression which would be used to scan the entire text once, and on each exec I would choose a corresponding "tail" expression (with sticky flag) which would produce its named group and advance the decider's lastIndex upon success. The second approach is what I meant by "parser-like" in my previous comment.
OK, thanks for explaining, I can see how this comes up in that case. If you can bare with me just slightly longer, I'm curious, can you say a little more about the context that this sort of issue has come up in a code base you're aware of in the past?
I don't think there are any JS repos worth mentioning that stumble on this since everyone knows how limited JS regexp engine is compared to PCRE so people either use a custom extended regexp library or switch to another language altogether. In the future, though, implementing this feature would allow all kinds of customizable scraping of text forms, documents, etc. Personally I think any regexp engine should strive to be as close to PCRE as possible within the constraints of effort/performance/size bandwidth.
Hi, I am an old perl user and as such I am often puzzled by discussions like this one.
From my POV multiple occurances are mandatory.
One big use case is in parser like situations, especially where you combine alternate syntax rules in one regexp. The date parsing above is a simple example of this. In my programming life, I used this a lot. e.g. yyyy-mm-dd vs. mm/dd/yyyy vs. dd.mm.yyyy
In this parsing use case, you could actually match each parsing rule separately and sequentially. However, you want to combine them into one big regexp (often computed from an array of values), because it matches much faster this way, especially if you have a lot of rules and multiple decision points. The combined regexp is an optimal solution, because it uses the regexp tree to walk through all the rules in parallel (which is the real power of regexps).
In most cases you have subexpressions, where you can distinguish the paths, e.g. for git like commands you usually have a command name and sometimes even subcommand names. You often use a switch on these values and then simply use the other names in that rule to process it's parameters.
Another use case is parsing alternate sequences of the same data, like date example. You often find this in natural languages or human input. In this case you directly use the values.
https://www.regular-expressions.info/named.html has a section about this topic: "Multiple Groups with The Same Name" that is a nice overview. So, all these implementations are different. This is sad...
Though, I mostly used these two variants:
(?<a>expr1)|(?<a>expr2)
the group a
contains the value from the branch that matchedThe last situation I remember, was parsing output lines of compilers and other tools. The expression was build from a collection of expressions that describe output formats e.g.
{
compiler_a: "(?<file>.+):(?<line>[0-9]+): *(?<message>.*)",
compiler_b: "in (?<file>.+), at (?<line>[0-9]+): *(?<message>.*)",
compiler_c: "(?<message>.*), at (?<line>[0-9]+) in file (?<file>.+)",
...
}
so I could match all with one expression (simply joined with "|" only once and then cached) and use the three groups. Because I didn't have the feature in javascript, I was forced to use a workaround. Either using a loop (slow) or numbering the names and find that one that wasn't empty (complicated).
I'm surprised that other languages give the first group which participated, rather than the last. I would expect later ones to clobber earlier ones, as happens when you hit the same group multiple times (/(.)+/.exec("ab")[1]
gives b
, not a
).
perl -e '"ab" =~ m/(.)+/; print $1'
prints b
and my tests on https://regex101.com/ also result in b
(not sure if they really use the original code)
so the tutorial may be wrong or outdated
But your example isn't "Multiple Groups with The Same Name", it's a numbered group.
that said, perl -e '"abc" =~ m/(?P<x>.)(?P<x>.)/; print $+{x}'
results in a
, as described.
and regex101 rejects it for all languages
I know my example is numbered groups. I was drawing an analogy: if you execute a numbered group multiple times, the match object ends up with the value from the last time it is executed. It's surprising, then, that if you execute the groups with the same name on different occasions, the match object ends up with the value from the first match.
now I understand, I had a similar thought... perl is interesting: https://www.regexplanet.com/advanced/perl/index.html
(?P<x>.)(?P<x>.)
->
$-{x}[0]=a
$-{x}[1]=b
.(?P<x>.)|(?P<x>.).
->
$-{x}[0]=b
$-{x}[1] is undef
(?|.(?P<x>.)|(?P<x>.).)
->
$+{x}=b
This proposal is finished and the repo is being archived, so discussion can't continue here. I've created a new repo to discuss this proposal, and I invite further discussion and contributions there: https://github.com/bakkot/proposal-duplicate-named-capturing-groups
Perl, Ruby and .NET all allow
multiple named capturing groups
to share the same name in the regular expression. As of 07.2018, current implementations ofnamed capturing groups
in browsers (I've checked it in Chrome 67 and FF 61) don't allow this. So this regular expression for strict date analyze is invalid:Do you consider to add support for multiple named capturing groups? I think it may help a lot. If we started to implement it as it in the other languages why don't implement it thoroughly with all the features available?