tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

Should we allow implementations to add to the escaped set? #27

Closed benjamingr closed 9 years ago

benjamingr commented 9 years ago

While researching Perl today we've found an interesting issue:

As an interesting point, RegExp grammar can add flags (like /x in the future) but RegExp.escape will not be able to react to these changes.

In both cases - this means that RegExp.escape can't guarantee the length of the returned string between versions of ECMAScript that make extensions to the grammar.

To me, this is a huge deal. I'm wondering - if implementations are allowed to extend the regular expressions grammar - should they be allowed to escape additional characters and we should make it clear that there is no length guarantee on the returned string?

(No other language has a length guarantee but some have a fixed character set which implies a length guarantee)

ljharb commented 9 years ago

To me, RegExp.escape's only guarantee is that the string is safe to pass into RegExp. The length of the returned string is basically irrelevant to me.

Would not any additions to the RegExp grammar simply require corresponding changes to the RegExp.escape implementation? (assuming it's worded correctly)

benjamingr commented 9 years ago

Would not any additions to the RegExp grammar simply require corresponding changes to the RegExp.escape implementation? (assuming it's worded correctly)

What concerns me is that flags like /x could be added in the future which would, for example, add comments (with a # syntax).

Moreover from what I understand implementations may choose to extend the RegularExpression syntax. Maybe allowing them to escape additional characters and suggesting that an implementation must ensure that the return result is possible to pass to a RegExp constructor?

On Fri, Jun 19, 2015 at 10:02 PM, Jordan Harband notifications@github.com wrote:

To me, RegExp.escape's only guarantee is that the string is safe to pass into RegExp. The length of the returned string is basically irrelevant to me.

Would not any additions to the RegExp grammar simply require corresponding changes to the RegExp.escape implementation? (assuming it's worded correctly)

— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/27#issuecomment-113607312 .

ljharb commented 9 years ago

That's exactly the kind of wording I'm thinking of. With that requirement in place, a builtin RegExp.escape would always (theoretically) be safe, and a polyfilled one could feature-detect these new forms easily.

benjamingr commented 9 years ago

So something like?

Note: a valid implementation must ensure that the return result does not result in an abrupt completion when passed as the first and only argument to the RegExp constructor. An implementation may choose to escape additional characters in order to maintain this guarantee.

ljharb commented 9 years ago

That's good, but the concern isn't just an abrupt completion, it's also about semantics changing as a new flag is added, right?

benjamingr commented 9 years ago

@ljharb what do you mean?

domenic commented 9 years ago

NOTEs are non-normative, just remove that prefix.

benjamingr commented 9 years ago

Thanks, what about the actual content?

ljharb commented 9 years ago

Since your text says "first and only" argument, and since I can't imagine anyone making any changes to regexes without requiring a new flag, I think that text is definitely sufficient. However, if someone were to make an existing flag do something different (which would also be a breaking change) then this wouldn't cover it - but that's probably not much of a real concern.

benjamingr commented 9 years ago

Yes, maybe it should say "any flag passed as a second argument". I want to be safe against a future /x mode for comments.

On Sat, Jun 20, 2015 at 8:49 PM, Jordan Harband notifications@github.com wrote:

Since your text says "first and only" argument, and since I can't imagine anyone making any changes to regexes without requiring a new flag, I think that text is definitely sufficient. However, if someone were to make an existing flag do something different (which would also be a breaking change) then this wouldn't cover it - but that's probably not much of a real concern.

— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/27#issuecomment-113793095 .

domenic commented 9 years ago

I agree with @ljharb that it's not just about abrupt completions but also about semantic equivalence. IIRC Ruby had something like new RegExp(RegExp.escape(s)) ~= s as the invariant, which seems more interesting.

benjamingr commented 9 years ago

Yes, I agree. What about:

A valid escape implementation must ensure that the return result holds the invariant that when passed as the first and only argument to the RegExp constructor the resulting regular expression will test true against the input string. That is new RegExp(RegExp.escape(s)).test(s) === true for any string s. An implementation may choose to escape additional characters in order to maintain this guarantee.

Do you see any downside to allowing implementations to escape additional characters? Are we sure we're ok with not giving any length guarantee?

domenic commented 9 years ago

I guess I'm not clear when the normative spec text as-is would not already guarantee that.

benjamingr commented 9 years ago

I guess I'm not clear when the normative spec text as-is would not already guarantee that.

Regarding implementations and regular expressions:

An implementation may extend the ECMAScript Regular Expression grammar defined in 21.2.1, but it must not extend the RegularExpressionBody and RegularExpressionFlags productions defined below or the productions used by these productions.

Now I'm not 100% sure either, but I think it means that regular expression syntax can be extended to a point where other implementations would need to escape more characters.

domenic commented 9 years ago

I guess specifically tie it to that clause then. Or maybe we should just remove that clause; it seems bad for the web in theory, and unused in practice, as far as I know.

benjamingr commented 9 years ago

I've sent a mail to esdiscuss, to be fair I'm not sure why this is the case either.

cscott commented 9 years ago

I suspect the loose spec around RegExp flags is pre-ES6. Now that we can extend built-ins, I'd expect that any engine which wants to add new RegExp syntax would do so as a subclass (and then extend RegExpSubclass.escape appropriately).

benjamingr commented 9 years ago

@cscott ES3 specifies:

An implementation may extend the regular expression constructor's grammar, but it should not extend the RegularExpressionBody and RegularExpressionFlags productions or the productions used by these productions.

Earlier versions of the spec did not have regular expressions. I wonder if any engines still in use in 2015 use this capability.

benjamingr commented 9 years ago

I've sent an email about extensions to esdiscuss: https://esdiscuss.org/topic/forbid-implementations-from-extending-the-regexp-grammar

If we can forbid implementations from extending the implementation on their own and not through the standard then this addition will not be required and the proposal will be simplified.

domenic commented 9 years ago

We should include some note about this in the TC39 presentation. I think it should be part of the proposal to eliminate that language.

As such, I would update the spec document with an appropriate <del>.

benjamingr commented 9 years ago

@domenic good idea, I want to make sure that I didn't miss anything first (like an implementation that utilizes it. I'm not sure how else I can verify that - I checked current and older engines and asked around.

If this is indeed safe to remove - I agree that it's a good idea to tie it to this proposal and add an appropriate <del> clause (and a slide in the presentation).

domenic commented 9 years ago

Yep. And we can drop it if TC39 objects; it's a separable part of the proposal to delete that. But I hope nobody will.

cscott commented 9 years ago

I'd rather put in language limiting extensions to the de facto (?...) extension mechanism (and flags). That also guarantees that RegExp.escape remains safe (since all proposals escape both ( and ?).

Is there any syntax in perlre which gets past our current proposals for RegExp.escape? If so, it may be worthwhile to future-proofescape`. --scott

benjamingr commented 9 years ago

This isn't really about RegExp.escape as an issue. We can always extend RegExp.escape when we make additions to the RegExp grammar as a standard that would not be a problematic issue since we can keep both parts in sync relatively easily.

On the other hand currently implementations are allowed to diverge in their regular expressions which means there is no guarantee that they'll behave the same in some cases. This is really risky for standard behavior and if an implementation decides to use the clause it could create a big pain point for developers while keeping it compliant.

Luckily, implementors are insightful people and implementations don't do that because of precisely those usability implications for end developers.

What I'm suggesting is that we forbid implementations from extending the regular expressions grammar beyond the language specification and outside the process. Basically, we take the ECMAScript regular expression object as specified (minus allowing extensions) as our base line :)

Cheers, Benji

On Mon, Jul 6, 2015 at 5:03 PM, C. Scott Ananian ecmascript@cscott.net wrote:

I think it would be more worthwhile of we tried to draw a compatibility boundary. Taking perlre as a baseline, for example, are there additional characters we should escape in RegExp.escape so that implementations (and the language itself) could add more perlre features without breaking compatibility? The (?...) syntax (and flags) seems to be the de facto extension point, can we protect that more narrowly? --scott On Jul 6, 2015 1:56 AM, "Benjamin Gruenbaum" benjamingr@gmail.com wrote:

So, following work on RegExp.escape [1] I found out that implementations may extend the regular expression grammar in JavaScript [2]. However, when asking esdiscuss and Stack Overflow about it [2][3] it doesn't look like any implementations currently do so (*).

Can we please forbid implementations from extending the regular expression syntax? It seems like this could cause compatibility issues between implementations anyway. We have subclassable RegExp with hooks and symbols in place and implementations that want to provide an extended RegExp can subclass RegExp or propose an extension to the language itself.

[1] https://github.com/benjamingr/RegExp.escape [2] https://esdiscuss.org/topic/why-are-implementations-allowed-to-extend-the-regular-expressions-syntax [3] http://stackoverflow.com/questions/30958288/what-ecmascript-implementations-extend-the-regexp-syntax

(*) some did it to implement ES2015 features before ES2015.


es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss

benjamingr commented 9 years ago

Benji -- but you've just specifically mentioned that implementations are already using the flexibility provided by the spec to experiment with and implement ES6 features. Why are you going to foreclose that possibility for ES2016+?

The perl community has managed to compatibly extend their regex engine to an almost absurd degree. There are plenty of ways to add features to your regexp that don't break "standard-compliant" regexps. I think that's the spirit of the existing language in the spec, and it should be honored. I'm fine with narrowing the scope, so we know that new flags and characters after (? are "reserved", for example, but I don't see why we have to ban future change outright. --scott

On Mon, Jul 6, 2015 at 10:13 AM, Benjamin Gruenbaum benjamingr@gmail.com wrote:

This isn't really about RegExp.escape as an issue. We can always extend RegExp.escape when we make additions to the RegExp grammar as a standard that would not be a problematic issue since we can keep both parts in sync relatively easily.

On the other hand currently implementations are allowed to diverge in their regular expressions which means there is no guarantee that they'll behave the same in some cases. This is really risky for standard behavior and if an implementation decides to use the clause it could create a big pain point for developers while keeping it compliant.

Luckily, implementors are insightful people and implementations don't do that because of precisely those usability implications for end developers.

What I'm suggesting is that we forbid implementations from extending the regular expressions grammar beyond the language specification and outside the process. Basically, we take the ECMAScript regular expression object as specified (minus allowing extensions) as our base line :)

Cheers, Benji

On Mon, Jul 6, 2015 at 5:03 PM, C. Scott Ananian ecmascript@cscott.net wrote:

I think it would be more worthwhile of we tried to draw a compatibility boundary. Taking perlre as a baseline, for example, are there additional characters we should escape in RegExp.escape so that implementations (and the language itself) could add more perlre features without breaking compatibility? The (?...) syntax (and flags) seems to be the de facto extension point, can we protect that more narrowly? --scott

On Jul 6, 2015 1:56 AM, "Benjamin Gruenbaum" benjamingr@gmail.com wrote:

So, following work on RegExp.escape [1] I found out that implementations may extend the regular expression grammar in JavaScript [2]. However, when asking esdiscuss and Stack Overflow about it [2][3] it doesn't look like any implementations currently do so (*).

Can we please forbid implementations from extending the regular expression syntax? It seems like this could cause compatibility issues between implementations anyway. We have subclassable RegExp with hooks and symbols in place and implementations that want to provide an extended RegExp can subclass RegExp or propose an extension to the language itself.

[1] https://github.com/benjamingr/RegExp.escape [2] https://esdiscuss.org/topic/why-are-implementations-allowed-to-extend-the-regular-expressions-syntax [3] http://stackoverflow.com/questions/30958288/what-ecmascript-implementations-extend-the-regexp-syntax

(*) some did it to implement ES2015 features before ES2015.


es-discuss mailing list es-discuss@mozilla.org https://mail.mozilla.org/listinfo/es-discuss

benjamingr commented 9 years ago

This doesn't look like a big deal. It seemed like I wasn't able to get the point (forbid 'allowed extension' and ask implementations to go through the spec after they extend RegExp or do it in a subclass) across. A shame but not a big issue for this proposal - we probably shouldn't tie this proposal to it.