tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

Interaction with backreferences / variable-width escape sequences #17

Closed nikic closed 9 years ago

nikic commented 9 years ago

Using something like new RegExp("(foo)\\1" + RegExp.escape(input)), if input were to start with a number, this would extend the backreference \1 to something like \11. Does this need to be accounted for?

benjamingr commented 9 years ago

@nikic good question.

At this moment we are not escaping anything that has any context, similarly ?! isn't escaped, - isn't escaped and so on.

Basically, whenever you need context sensitivity you'd do:

new RegExp("(foo)\1(" + RegExp.escape(input) +")")

Accounting for all context cases sounds problematic.

nikic commented 9 years ago

@benjamingr I think there is a difference between this case and something like ?! though: It could only become an issue if you insert into a malformed regular expression, i.e. you'd have to write "(?" + RegExp.escape(...) + ")", where the syntax (?...) by itself (i.e. not followed by : or any of the other usual group flavors) is not valid.

bwoebi commented 9 years ago

@benjamingr It also is possible to just add a non-capturing group around the whole; that way it is context-independent. At that point it's then also enough to have the default set of escapes which we have currently. … Except that it will bork with [] character classes.

So, generally, I think that's an unavoidable issue and must be handled by the programmer.

benjamingr commented 9 years ago

@bwoebi I don't think we have those (non capturing groups).

@nikic yes, this is more like that - case or other cases where it can change the context of the RegExp.

bwoebi commented 9 years ago

@benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0} as prefix, which would work in JS too. But point is, that'd brok again with character classes.

Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.

benjamingr commented 9 years ago

Can you write a short "proof" illustrating that it is impossible (or hard) to write escape in a way that would allow escaping in a way that would preserve context? (Without non-capturing groups)

Sensitive escaping might be our plan for RegExp.tag in the future (name not final).

On Fri, Jun 19, 2015 at 3:45 PM, Bob Weinand notifications@github.com wrote:

@benjamingr https://github.com/benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0} as prefix, which would work in JS too. But point is, that'd brok again with character classes.

Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.

— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/17#issuecomment-113504953 .

nikic commented 9 years ago

I don't see what would be hard about it, just escape - and digits additionally (at least if occurring as first character). I don't think the problem here is technical, it's more a question of whether or not such cases should be supported.

(Where "escaping" for digits would mean encoding them.)

bwoebi commented 9 years ago

@benjamingr Point is that we cannot reliably prevent that very case @nikic did mention in his initial post. If you insert any characters before the string (note that we cannot escape numbers, that'd mean a backreference), it won't be compatible with a character class as it counts as a symbol there.

@nikic How do you want to escape numbers? By their octal representation? Then RegExp.escape("0") +"0" will end up as "\300" … character number 192 then. Putting a backslash in front would mean a back-reference… so, what now?

EDIT:

Hmm... okay, we can hex-escape that then… that will work. So, turns out I was wrong a bit and it's indeed possible to properly escape it

nikic commented 9 years ago

@bwoebi As said, by encoding them. I am not terribly familiar with details of JS regex flavor, but presumably it has some kind of fixed-width escape sequence. Probably \xHH or \uHHHH.

bwoebi commented 9 years ago

@nikic yeah, see my EDIT, noticed that just now.

benjamingr commented 9 years ago

Numbers like 1 can be escaped like [1] but I don't see a way to reliably escape things.

Adding a capturing group will mess with back-references.

Other solutions seem to have issues with quantifiers and context sensitive inserts. I don't see a way we can solve this from the library level - at least without parsing the RegExp ourselves first.

domenic commented 9 years ago

I'm not sure I fully understand all the issues here, but the sense I'm getting is that maybe just going for an escape-everything approach (#15) would solve things? (EDIT: I saw the doc about how other languages don't like that idea, interesting.)

domenic commented 9 years ago

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.

benjamingr commented 9 years ago

Python is moving to a escape only metacharacters approach in its new regex module (replacing re) by the way. https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-409

(Courtesy of Martijn Pieters)

What this issue points out is that escaping everything wouldn't work and that escaping needs to be context aware so it is the programmer who has to handle escaping in these cases.

The two other participants are PHP core contributors who I invited to participate in the discussion by the way :)

On Fri, Jun 19, 2015 at 6:13 PM, Domenic Denicola notifications@github.com wrote:

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.

— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/17#issuecomment-113545168 .

nikic commented 9 years ago

Speaking in formal, the suggestion here is basically this:

REPLACE

  1. If c is matched by SyntaxCharacter then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
  2. Append the elements of the UTF16Encoding (10.1.1) of c to cuList.

WITH

  1. If c is matched by SyntaxCharacter then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
    2. Append the elements of the UTF16Encoding (10.1.1) of c to cuList.
  2. Otherwise, if c is matched by HexDigit (*) then do:
    1. Append code unit 0x005C (REVERSE SOLIDUS) to cuList.
    2. Append code unit 0x0078 (LATIN SMALL LETTER X) to cuList.
    3. Append two code units that are the result of converting the numeric code point value of c to a String of two hexadecimal digits. Alphabetic hexadecimal digits are presented as lowercase Latin letters.
  3. Otherwise append the elements of the UTF16Encoding (10.1.1) of c to cuList.

(*) Potential addition: "And c is the first element of cpList"

Btw, just for the record PHP also ignores the issue with backreferences and escapes only meta characters (albeit a larger number). Doing that has a lot of precedent in other languages.

benjamingr commented 9 years ago

In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme.

Definitely @domenic , I'm not sure the discussion has exhausted itself yet though.

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

It would also have the effect of making the regexp created less readable. Wouldn't it make more sense to just insert an empty capturing group into the RegExp if the first literal matches HexDigit?

ljharb commented 9 years ago

@benjamingr @bwoebi Isn't ?: how you denote a non-capturing group in JS regex? See RegExp.prototype.source as an example.

benjamingr commented 9 years ago

@ljharb oh wow yeah, it is. I have no idea how I totally missed that while I used it last week at least twice or when I read the spec for .source today.

Just prefixing and postfixing the RegExp with (?:) would not solve all context sensitivity issues. For example, it would not work if you're inside a set match ([ ]). See any way around it?

nikic commented 9 years ago

@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"

I'm not sure I see the issue you're referring to here, could you elaborate? Apart from use of - in there, but that can be escaped as well.

benjamingr commented 9 years ago

@nikic as an expert, do you think the readability impact of the resulting regular expression is an important factor? If so - do you think this guarantee is worth it?

benjamingr commented 9 years ago

Superseding this with https://github.com/benjamingr/RegExp.escape/issues/29