tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

Path to Stage 4! #58

Open ljharb opened 1 year ago

ljharb commented 1 year ago

Stage 4

~Stage 3~

~Stage 2.7~

~Stage 2~

~Stage 1~

oliverfoster commented 1 year ago

Congratulations on getting to Stage 2. :+1:

benjamingr commented 1 year ago

Hey, need help on any of these? (I'm happy to try and contribute the test262 tests or amend the spec according to the consensus)

ljharb commented 1 year ago

I'm comfortable handling the spec, but the test262 tests would be most helpful. Be warned, though, they'll need to be extremely rigorous.

jridgewell commented 1 year ago

My review is at https://github.com/tc39/proposal-regex-escaping/issues/60

michaelficarra commented 1 year ago

My feedback:

  1. I don't like how this is introducing identity escapes in Unicode mode for these new punctuators. This wasn't a goal of the RegExp escaping proposal. You can accomplish RegExp escaping with hex escapes and without introducing new features to the RegExp grammar.
  2. There's no reason to define the phrase "the ASCII punctuators that need escaping" if it's just used once, in the algorithm that follows. Just inline the string. Also I don't like how the string is denoted since it contains a ". Can you figure out some other way to denote it?

Maybe a table (code point, hex escape string) would solve both of my issues.

ljharb commented 1 year ago

For the second, I have no strong opinions about how to denote the characters; I can certainly inline the string.

For the first, isn't that required, otherwise this proposal won't be able to produce output that works in both u and v mode?

michaelficarra commented 1 year ago

No, I believe hex escapes are sufficient for this purpose. Do you have a specific counterexample?

ljharb commented 1 year ago

cc @bakkot since this was a result of their research

bakkot commented 1 year ago

Hex encoding works, it just makes the output completely unreadable. I can't imagine we're going to make \& mean anything other than & in u-mode regexps, so I don't think there's much cost in making those legal, and I think we ought to pay that cost for the benefit of more readable output.

michaelficarra commented 1 year ago

I do not value the readability of the output. This function is meant only for composing and then compiling RegExps.

bakkot commented 1 year ago

Sometimes you have to debug compiled regexps.

michaelficarra commented 1 year ago

You can decompile them and represent them however you like. Paste them into a visualiser.

bakkot commented 1 year ago

That's a bunch more work than doing console.log(regex). Why should anyone have to do that work? That's a very concrete cost to your suggestion and I see approximately no offsetting benefit. What benefit do you see to your suggestion?

ljharb commented 11 months ago

@michaelficarra as far as the spec review goes, I've addressed your editorial comment; can I check you off?

The normative one we can certainly discuss further if needed.

michaelficarra commented 11 months ago

The spec text is fine for what you intended. I still have issue with the escaping design though.

ljharb commented 11 months ago

Thanks! I'll check you off, but can you file a new issue to further discuss the escaping design? That seems like the only obstacle to seeking stage 3.

sophiebits commented 11 months ago

Escaping using hex escapes also has the advantage that it's straightforwardly polyfillable in practice, whereas changing the UnicodeMode grammar would seem to not really be.

gibson042 commented 11 months ago

My review:

The first issue is trivial to fix and the second is purely editorial, leaving only the third as something to actually resolve (potentially even by accepting the risk and making no normative change). In my opinion, this is ready for stage 3.

ljharb commented 11 months ago

First two are fixed; we can discuss the third in #66 but iirc the intention was simply to not support using the output of RegExp.escape in contexts like that.

syg commented 9 months ago

Spec draft is fairly short and I didn't see any glaring editorial issues.

I find the alias definition for "the ASCII punctuators that need escaping" as a separate paragraph the precede the algorithmic steps strange. Why not have a step that says "Let the ASCII punctuators [...] be the String [...]"?

ljharb commented 9 months ago

Sure, I could do that - happy to go with whatever yall want there.

ljharb commented 7 months ago

ok, this has been reworked with #67 - @jridgewell @michaelficarra @gibson042 @syg @bakkot, can you confirm that you're signed off, assuming you are?

jridgewell commented 7 months ago

LGTM

bakkot commented 7 months ago

LGTM

michaelficarra commented 7 months ago

@bakkot Does the string coercion in step 1 align with our new coercion strategy?

bakkot commented 7 months ago

Oh, good point. This should throw a TypeError on any non-string inputs.

ljharb commented 7 months ago

ah, good catch. updated in 29b08c3f5c8e450430bf8e5fcfea28a4e0d683e2

michaelficarra commented 7 months ago

LGTM

gibson042 commented 7 months ago

I see one normative issue and a handful of editorial issues:

ljharb commented 7 months ago

Regarding the last item, I'm not sure optimizing for brevity helps readability here.

Regarding item 4, I had "the ASCII punctuators that need escaping" as a dfn, but removed it after https://github.com/tc39/proposal-regex-escaping/issues/60#issue-1916402272 and some editor feedback that a dfn wasn't needed.

Will take a look at the item 1 issue (thanks!) and will fix item 2 soon.

michaelficarra commented 7 months ago

@ljharb The point about an unescaped delimiter (") and a possibly-confusing backslash still stands. We can just construct punctuators by concatenating the non-controversial characters as a string with the individual problematic code units. Thankfully order doesn't matter so it's easy to just stick them on the end.

michaelficarra commented 7 months ago

Let punctuators be the string-concatenation of "(){}[]|,.?*+-^$=<>/#&!%:;@~'`", the code unit 0x0022 (QUOTATION MARK), and the code unit 0x005C (REVERSE SOLIDUS).

ljharb commented 7 months ago

ah ok, gotcha, thanks.

ljharb commented 7 months ago

k, that's landed in c95db790abb5182897a73c0270adac4ffecc61b5.

ljharb commented 7 months ago

Filxed #71 to handle the surrogate pair stuff.

ljharb commented 7 months ago

@gibson042 @syg latest changes are landed; can you confirm if/that you've signed off?

gibson042 commented 7 months ago

Yep, that resolves the normative issues so works for me. I'll open PRs to demonstrate my editorial preferences.

ljharb commented 7 months ago

@syg given your stamp on #73, does that mean you've signed off?

syg commented 7 months ago

@syg given your stamp on https://github.com/tc39/proposal-regex-escaping/pull/73, does that mean you've signed off?

Yep, editorially lgtm.

ljharb commented 7 months ago

No advancement this meeting:

Once these are resolved, I'll return and re-ask for 2.7.

michaelficarra commented 7 months ago

@ljharb I imagine it means anything in ControlEscape, which doesn't have any alternative conditional on the UnicodeMode flag.

bakkot commented 7 months ago

Presumably also SyntaxCharacter and /, i.e. those characters which can be used in IdentityEscape even in u-mode RegExps.

bakkot commented 7 months ago

The other thing in CharacterEscape is \0. \0 is an interesting case, since you can't use it if the next character is an ASCII digit. I would favor not doing \0; \x00 is one of the few hex escapes you don't need to memorize.

waldemarhorwat commented 7 months ago

You can't do \0 for the NUL character because bad things could happen if the escaped string ended with a NUL and got concatenated with something that started with a digit. \x00 is fine for NUL.

mevanlc commented 6 months ago

I've been watching the progress on this repo for a long time now. I just want to say I appreciate the hard work and careful thought everyone has put into this (including the upcoming work as well.) It's been fascinating observing the concerns and challenges that the web environment raises for ECMAScript that were not much of a concern for other languages that have long had the same feature. I am wondering if the repo owner(s) would consider opening the GitHub Discussions tab in order to allow more back-and-forth discussion with fewer concerns about constantly pinging the people watching the Issues? I am sure if anything interesting distills out of Discussions, people will be more than happy to filter the info back into Issues (I don't think there is an expectation that the owners will monitor / reply to discussions, besides the unlikely chance that moderation is needed.) Just a thought.

ljharb commented 6 months ago

I'm not sure there's any way to disable issue notifications while enabling discussion notifications, and I think the same expectations exist for both about owner responsiveness.

ljharb commented 5 months ago

@jridgewell @michaelficarra @gibson042 @syg the proposal has been updated, and a fresh review would be appreciated :-)

jridgewell commented 5 months ago

LGTM

gibson042 commented 5 months ago

Comments:

michaelficarra commented 5 months ago

All the same points as @gibson042, plus there's some wording things that I would change, but they wouldn't hold up stages 2.7/3. LGTM otherwise.