tc39 / proposal-regexp-named-groups

Named capture groups for JavaScript RegExps
https://tc39.github.io/proposal-regexp-named-groups/
222 stars 21 forks source link

GetSubstitution: reference to invalid group name in replacement string #14

Closed schuay closed 7 years ago

schuay commented 7 years ago

GetSubstitution currently does not distinguish between references to capture names that:

All of these cases are replaced with the empty string.

“ab”.replace(/(?<fst>a)|(?<snd>b)/u, “$<42$1>”));     // “b” (Invalid name)
“ab”.replace(/(?<fst>a)|(?<snd>b)/u, “$<snd>”));      // “b” (Valid but unmatched name)
“ab”.replace(/(?<fst>a)|(?<snd>b)/u, “$<thd>”));      // “b” (Valid but non-existent name)
“ab”.replace(/(?<fst>a)|(?<snd>b)/u, “$<fst>”));      // “ab” (Valid matched name)

This is somewhat consistent with the behavior of numbered references, but we could be stricter here and throw a SyntaxError if an invalid name is referenced.

What do you think?

mathiasbynens commented 7 years ago

Throwing on invalid name references sounds good to me. I assume this would be a run-time error — could it reasonably be an early error?

hashseed commented 7 years ago

I had the impression that SyntaxErrors are early errors. Then again it's debatable whether this qualifies as early error, if you compare to eval.

bakkot commented 7 years ago

Not all SyntaxErrors are early errors: RegExp('(') throws one at runtime, e.g.

@mathiasbynens, I don't know how this could be an early error, since the issue is in a normal string parameter to a function.

hashseed commented 7 years ago

I'd classify early errors as ones that happen during parsing. E.g. thrown from JSON.parse, eval, Function constructor, RegExp constructor, etc.

In this case you could argue that the replace string needs to be parsed, and when we find an error at that time, it qualifies as early error, therefore a SyntaxError.

littledan commented 7 years ago

My reading of what an early error is is the same as @hashseed 's (e.g., @bakkot 's example is an early error as it's an eval-like construct), but I don't know what this has to do with SyntaxError. You get a SyntaxError from JSON.parse('abc') currently, for example.

What distinguishes an early error is that it happens before evaluation. In the case of these format strings, we could separate out a distinct time in String.prototype.replace, before the other observable Get calls are done. Would one or the other (throw when reached) be easier for implementers?

I'd imagine that, at the very least, valid and existing but unmatched names would be replaced by the empty string. This would be an analogous decision to creating the groups as properties with the value undefined in the groups object.

Here's a possible semantics to distinguish them: Call [[GetOwnProperty]] on the Groups object. If the property isn't present, throw (this will happen for invalid names too). If it's present with the value undefined, use the empty string. If it's present with another value, call ToString on the value and use that. (I guess if it's a setter, call it and ToString that).

These semantics are more like runtime errors than early errors, but given RegExp subclassing, it's tricky to do much better. Thoughts?

hashseed commented 7 years ago

I guess the only difference between an early error and a late error here would be:

"abc".replace(/(?<x>x)/u, "$<y>")

Does this throw? An early error would figure out that the replace pattern is incompatible with the regexp. A late error would not occur because there is no match.

I don't have a strong opinion on this.

littledan commented 7 years ago

@hashseed I don't see how we could make that work with @allenwb 's RegExp subclassing design that's in ES2015 and later. There's no API to figure out what named groups a RegExp has--all you can do is look at what properties exist on the match object that gets returned by exec.

hashseed commented 7 years ago

That's true. I wonder whether that's an argument to not throw a SyntaxError, but a TypeError instead.

schuay commented 7 years ago

Throwing early (any time before GetSubstition) does not make sense to me. Seems like that would complicate logic without much clear benefit. +1 to throw when reached.

SyntaxError sounds good as that seems to best represent what actually happened, but no strong opinion against TypeError if you think that is a better match.

I'm not sure whether a reference to a valid but non-existent names should throw at all. No strong opinion here either.

I think we agree on

littledan commented 7 years ago

@schuay So the remaining question is whether the following would throw?

“ab”.replace(/(?<fst>a)|(?<snd>b)/u, “$<snd>”));      // “b” (Valid but unmatched name)

The current spec doesn't throw, but I was suggesting here for it to throw. Anyway, I have no strong opinion either.

hashseed commented 7 years ago

I'd follow the behavior of numbered captures for least surprise. I.e. not throw, and replace with empty string instead.

littledan commented 7 years ago

Should we bother throwing for non-IdentifierName names, or just stick with the current spec? An earlier version threw in that case, and I changed it to ignoring the case based on @schuay 's feedback.

schuay commented 7 years ago

@littledan did you mean 'valid but non-existent' in the comment above?

Thinking about this again, to me the main advantage of throwing for invalid or non-existent captures is to catch user errors early:

str.replace(re, "$<unterminated")  // already throws
str.replace(re, "$<invalid_or_nonexistent>")  // doesn't

An argument against throwing for invalid or non-existent captures is that empty replacements could also be used as a feature, to reuse the same replacement string for multiple regexps with different named captures.

So I think we should either go with the current spec (empty string replacement for all cases), or throwing for both invalid and non-existent names. I'd be ok with going either way.