Options for disambiguating the \k backreferences

littledan commented 7 years ago

Backreferences to named capture groups have the surface syntax \k<foo>, but that already has semantics in non-Unicode RegExps. A few options have been discussed for the issue:

Named capture groups only usable in Unicode mode This was the original proposal, and it's what V8 implements behind a flag. This is one reason why we reserved extra sequences like \k to be a syntax error in Unicode mode--so we could add new features this way. It's the simplest option, and it would give people a carrot to upgrade to using the Unicode flag. Going from 1, we could "always" add 2 or 3 "later". @mathiasbynens and @hashseed have argued for this minimal option.
Named capture groups can be used outside of Unicode mode, but named backreferences are only with Unicode mode on There seemed to be some concern from the committee that this is something of an unexpected cliff in the middle of the feature. Another argument against it is that we shouldn't add new things to non-Unicode RegExps to encourage people to flip the flag on. This was my funny idea.
Disambiguate by making \k it have the new semantics if there are any named capture groups This is definitely possible, but more complicated than one might think at first. If there are no named capture groups, then \k can be anywhere, but otherwise, it needs to be followed by < IdentifierName >; this complicates the grammar. Another piece of complexity is that an implementation can't determine whether there are named capture groups on-line, if lookbehind is in play (because lookbehind semantics are executing the RegExp backwards, and this affects captured groups. For example: /(?<=\k<a>(?<a>.))/ matches a zero-length sequence which is preceded by the same character twice. It's definitely unambiguous, just complicated. This was @bakkot's suggestion.

At the September TC39 meeting, we seemed to come to consensus on 3; however, this was without incorporating some feedback from people not present in the room, and without a full understanding of the complexity of 3. With the complexity of 3, and the weird cliff of 2, I'm personally leaning back towards 1. OTOH, 3 feels the most "1JS-y" to me. Any thoughts?

mathiasbynens commented 7 years ago

IMHO the u flag is to regular expressions as strict mode is to all of JavaScript — it should be considered the new default. Any new features should be made available to Unicode mode only. Improving non-Unicode mode is not worth even the smallest amount of complexity.

hashseed commented 7 years ago

I'm a bit torn on this. I also have the sentiment that /u is the strict mode of regexp. But then again, you wouldn't require /u for new features such as lookbehind assertions, right? \k outside of /u is unambiguous, so that argument is void.

littledan commented 7 years ago

@mathiasbynens We made the opposite decision with strict mode. Classes, let, destructuring, etc are all available in sloppy mode, based on @dherman 's 1JS policy, at the cost of a lot of design and implementation complexity, but with the benefit that users don't have to think as much about which mode they're in, and it's easier to incrementally adopt new features in code bases that might be difficult to switch to strict mode.

littledan commented 7 years ago

Waldemar Horwat pointed out to me that we could implement @bakkot's disambiguation strategy (#3 above) in the spec by reparsing from the beginning if a named group is encountered when not in named group mode. In a conversation with @schuay @hashseed and @ErikCorryGoogle, it was noted that V8's RegExp engine does this internally in some cases already, so it seems like we could add another bail-out and reparse-with-a-flag case if a \k preceding a named group.

Although I changed the explainer and spec text to be Unicode-only, I'm now having second thoughts since supporting non-Unicode mode seems pretty doable both on the spec and implement. Thoughts?

bakkot commented 7 years ago

@littledan I think you could even do a bail-out and reparse-with-a-flag if you encountered a \k not immediately followed by <...>, or a \k<...> not preceded or followed by a (?<...>). I expect this almost never happens in practice, so I'd hope it would be a negligible performance hit.

Personally I like my proposal (😄) and I think it's the one most likely to satisfy the committee, if engines feel it's implementable.

littledan commented 7 years ago

Any opinions from other engines? cc @jswalden @syg @bterlson @msaboff @kmiller68

jswalden commented 7 years ago

My knowledge of new (-er than ES5!) regular expression syntax/semantics/flags is minimal for lack of time to follow it (and, somewhat, comparative disinterest). So whether you should listen to this at all, I dunno. And I'm probably unlikely to care too strongly if anyone or the spec thinks differently from me. :-)

If we step past that, I mildly tend toward option 1 for simplicity. Option 2 just seems an awkward splitting of the baby to me: a half-working feature, sometimes. Option 3 is better than 2. But reparsing things is always inelegant, and it's IMO much easier to explain that the functionality just isn't available in non-Unicode regular expressions (versus some more complicated explanation that it's available only sometimes, depending on what other functionality is used -- pattern syntax is hoary enough, and we want to introduce even more complexity?).

The one demerit to option 1 is that you can't just tack a u onto your regular expression to use captures, because u changes other stuff. I'd bet lots of regular expression use parses presumed-ASCII sorts of text -- programmatic grammars, say. If someone tacks a u onto a regex that uses \w to use named captures, (say) an HTTP parser could match more things than intended. This feels like not enough hazard to preclude option 1, but it's not a slam-dunk.

One final overriding consideration: given that we just embed a minimally-modified fork of V8's irregexp, it's doubtful that any option really has different implementation burden for us. Considering only implementation effort, SpiderMonkey probably doesn't care what's decided.

littledan commented 7 years ago

@jswalden, when you say simplicity and minimalism, do you mean simplicity for users, spec authors, or implementers? It's hard for me to see how 3 creates complexity for users, though it definitely creates additional complexity for spec authors and implementers.

mathiasbynens commented 7 years ago

With option 3, what happens if \k is used without any named capture groups in combination with the u flag? /\k/u throws, currently — IMHO it would be confusing if this proposal made it not throw anymore despite that code sample not really using any new functionality.

littledan commented 7 years ago

In the current spec text, it would continue to throw a syntax error, as the \k is expected to have a group named with it for a Unicode RegExp. Also, /\k<foo>/u throws, as foo has no GroupSpecifier for it.

bterlson commented 7 years ago

IMO:

2 is no good.

1 and 3, hard to say without seeing how complex things get, but I did hear some argue in TC39 that there is strong use cases for non-Unicode regexps. I don't necessarily agree, but if it's true, option 3 seems better than option 1. If /u is a regexp strict mode then option 1 is better.

ErikCorryGoogle commented 7 years ago

I'm happy with all, really.

re: 1: Unicode regexps have unavoidable complexity that will always make them slower.

re: 2: Bear in mind that backreferences are much rarer than captures, and it is likely that named backreferences will also be much rarer than named captures.

re: 3: Parsing the regexps twice is something you already have to support in non-u mode. See https://hackernoon.com/the-madness-of-parsing-real-world-javascript-regexps-d9ee336df983#.hf2oxfg4f for some details if you didn't already.

hashseed commented 7 years ago

Casting my vote for 3. User ergonomics is more important than implementation details imo.

I remember a while ago, for indexed back references, when we encountered a reference with an index that is not yet assigned to a group, we would parse ahead to figure out whether the index is a valid one, or has to treated verbatim. This is something we could do here as well.

jswalden commented 7 years ago

@littledan I think it creates complexity for users by giving \k different meanings within flag-based categorization of regular expressions. To analogize with respect to @ErikCorryGoogle's post, option 3 makes \k another \c -- its meaning wackily depends upon other things around it. In a non-Unicode pattern, it's the literal character k if there aren't named capture groups, otherwise it's the captured text. Seems equally wacky. I think it would also create problems for users, for two reasons.

One, regular expression syntax is very deep, and it varies widely across languages making it harder to know. So it's easy to misremember a particular bit of syntax (like that for named capture groups) and fail to produce one, with the result that an intended backreference isn't one. Under option 1, such misrecollection is a syntax error because the necessarily-Unicode flag makes it so. Under option 3, misrecollection isn't a syntax error when not Unicode.

Two, even someone remembering the syntax might mistype. Under option 1, the Unicode flag being less forgiving of typos results in a syntax error. Under option 3, if no Unicode flag, mistyping produces misbehavior that may be silent at runtime.

littledan commented 7 years ago

@jswalden I'm not excited to go into this world of wackiness either, but I think it's mitigated by a couple things:

\k in non-Unicode RegExps is sort of not a concept at all--it's a fallback because \k has no meaning
It's hard for me to picture why someone would type a named backreference without any named groups in the RegExp. Is this that they are refactoring and edit out the last named group?
If someone does make the mistake, it seems relatively easy to debug--the resulting RegExp will fail relatively loudly, by matching nothing.
Even if we go with 1, there is still a possible (though arguably smaller) hazard where a user removes the u from the RegExp (say they want ascii-only case insensitivity for their application, and are doing this as part of a bug fix), sees complaints about the syntax error from having the named groups, so they remove the names as they are being captured, but leave in the \k references. Then they'll encounter a similar problem of, unexpectedly, trying to match the "k<foo>" string.

This feels like a place where @dherman 's 1JS logic makes sense--it's probably easier for users to conceptualize one RegExp language, even if it grows some wonky edge cases, and grow new features in both the old and new side, because it helps us meet users where they are, and create the feeling of one big almost-coherent language rather than multiple divergent languages.

littledan commented 7 years ago

At the January TC39 meeting, a concern was raised about how well the grammar for all this works out. In the end, it seems to have worked out just fine. I'm going to leave the proposal at Option 3 here, unless issues come up in implementation.

tc39 / proposal-regexp-named-groups

Options for disambiguating the \k backreferences #7