tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
363 stars 32 forks source link

RegExp.escape escaping SyntaxCharacter alone is insufficient #48

Closed michaelficarra closed 11 months ago

michaelficarra commented 3 years ago

If the idea for RegExp.escape is to allow injection in any context, - needs to be escaped in character class context. - is not part of SyntaxCharacter. This is just the first character I thought of needing escaping, and it wasn't escaped, so there's probably others.

benjamingr commented 3 years ago

@michaelficarra can you demonstrate where escaping - is helpful?

michaelficarra commented 3 years ago

> as well for capturing group names. Though that context requires \uXXXX style escaping of anything that doesn't otherwise match RegExpIdentifierName...

benjamingr commented 3 years ago

https://github.com/tc39-transfer/RegExp.escape/blob/master/EscapedChars.md#safe-with-extra-escape-set-proposal

michaelficarra commented 3 years ago

@benjamingr

new RegExp('[' + RegExp.escape('a-z') + ']')

matches all of the letters instead of a, -, and z alone.

michaelficarra commented 3 years ago

@benjamingr Okay, I would prefer the "Safe with extra escape set" proposal described there, then. It seems like that better meets the goal of this proposal as I understood from what was presented at the TC39 meeting today: RegExp.escape output should be safe in any RegExp context. That means outside any context, each code point is matched literally; within a character class each code point is unioned onto the class; in a named capture, each code point contributes to the capture name; etc.

benjamingr commented 3 years ago

@michaelficarra last time I checked (admittedly 6 years ago, you can see the repos scanned and output in the data directory) this sort of use case was very uncommon.

(Note, if we escape - it probably [also makes sense to escape } and ]](https://github.com/tc39-transfer/proposal-regex-escaping/blob/master/data/other_languages/discussions.md#c))

benjamingr commented 3 years ago

I am not objecting to escaping it or the "safe with extra escape set" proposal - I'm only pointing out how it was removed from the set last time we talked about this. I think it makes sense to audit every character we escape (or not) in RegExp.escape given 5 years have passed since the list of escaped characters was made.

( also on an unrelated note - Thank you for weighing in it is appreciated! 🙏 )

Edit: also see prior discussion.

michaelficarra commented 3 years ago

If this proposal was going to go toward a RegExp.escape-style solution, I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts. We'll have to figure out what that means when injecting arbitrary values into a context with a limited grammar, such as the capturing group names I mentioned above or the repetition numbers (suffixed {x,y}) mentioned in the documentation here.

ljharb commented 3 years ago

One solution someone suggested was to include syntax that would make it invalid in some of the contexts - like appending {1} to ensure you couldn't combine it with a quantity, or if there's something that would make it invalid in a character class.

domenic commented 3 years ago

the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts

Isn't that impossible, because of the even-odd problem? That requirement is what sunk the proposal last time from what I remember; the community is pushing to remove this requirement.

benjamingr commented 3 years ago

I think the only way it would be acceptable to some of the delegates following this proposal (including myself) is if we could make it so that usage was (and could continue to be) completely context-independent, for all RegExp contexts.

Why? Isn't solving the problem for all (very) reasonable cases enough?

michaelficarra commented 3 years ago

@domenic I am personally fine with not considering the context following \. I don't think anybody would expect new RegExp('\\' + RegExp.escape('t')) to match a tab.

bakkot commented 3 years ago

I think as a practical question, ignoring the context following \, "escape gives output which works in any context" would mean including the SyntaxCharacter set, plus - for character classes, , for repetition groups, and > for capturing group names. (We'd want to change u-mode regexes to make the escaped versions of those legal as identity escapes, or specify that they map to their \u form.)

@michaelficarra would that work for you?

benjamingr commented 3 years ago

@michaelficarra

I don't think anybody would expect new RegExp('\' + RegExp.escape('t')) to match a tab.

I agree. I also don't think anyone would expect RegExp.escape to work in context sensitive situations though, would they?

I think that's an assumption PHP, Perl, Python, Ruby, Java and .NET made in their implementation of their respective escaping functions. It's also the assumption userland libraries (like lodash) have made.

I am honestly not sure about this - and I can see good points in both approaches. I do however think the bias should be for the approach other implementations we've looked at are doing naturally (in JS, even the userland implementions we've found of a tag function like XRegExp's don't do context-sensitive escaping).

Note that other languages have been working on reducing the set of characters they escape in order to create more readable regular expressions - a good example of this is Python.

I am personally fine with starting with the "Escape Everything" approach that solves most things (but not everything) - but I would still prefer the "escape as little as needed without creating a user expectation of RegExp.escape working on context sensitive scenarios.

Most use cases I've seen were things like:

const names = getArrayOfNames(); // ["John Smith", "Bob D. Goldman", "Zhang Zhu"];
const matcher = new RegExp('(' + names.map(name => RegExp.escape(name)).join(')|(') + ')');

Where context sensitivity was a non-issue.

domenic commented 3 years ago

The specific idea of escaping - has bad interactions with the u flag; see #52.

ljharb commented 11 months ago

Updated escaping semantics advanced to stage 2 today.