Closed nikic closed 9 years ago
@nikic good question.
At this moment we are not escaping anything that has any context, similarly ?!
isn't escaped, -
isn't escaped and so on.
Basically, whenever you need context sensitivity you'd do:
new RegExp("(foo)\1(" + RegExp.escape(input) +")")
Accounting for all context cases sounds problematic.
@benjamingr I think there is a difference between this case and something like ?!
though: It could only become an issue if you insert into a malformed regular expression, i.e. you'd have to write "(?" + RegExp.escape(...) + ")"
, where the syntax (?...)
by itself (i.e. not followed by :
or any of the other usual group flavors) is not valid.
@benjamingr It also is possible to just add a non-capturing group around the whole; that way it is context-independent. At that point it's then also enough to have the default set of escapes which we have currently. … Except that it will bork with []
character classes.
So, generally, I think that's an unavoidable issue and must be handled by the programmer.
@bwoebi I don't think we have those (non capturing groups).
@nikic yes, this is more like that -
case or other cases where it can change the context of the RegExp.
@benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0}
as prefix, which would work in JS too. But point is, that'd brok again with character classes.
Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.
Can you write a short "proof" illustrating that it is impossible (or hard)
to write escape
in a way that would allow escaping in a way that would
preserve context? (Without non-capturing groups)
Sensitive escaping might be our plan for RegExp.tag in the future (name not final).
On Fri, Jun 19, 2015 at 3:45 PM, Bob Weinand notifications@github.com wrote:
@benjamingr https://github.com/benjamingr We don't have non capturing groups, but that's just the "proper" way to delimit that… one also could use .{0} as prefix, which would work in JS too. But point is, that'd brok again with character classes.
Issue is that it's not really possible to have real context sensitive escaping… the only idea I have for that is how e.g. prepared statements work with mysql… you basically have placeholders in your query/regex and then can context-sensitively escape them.
— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/17#issuecomment-113504953 .
I don't see what would be hard about it, just escape -
and digits additionally (at least if occurring as first character). I don't think the problem here is technical, it's more a question of whether or not such cases should be supported.
(Where "escaping" for digits would mean encoding them.)
@benjamingr Point is that we cannot reliably prevent that very case @nikic did mention in his initial post. If you insert any characters before the string (note that we cannot escape numbers, that'd mean a backreference), it won't be compatible with a character class as it counts as a symbol there.
@nikic How do you want to escape numbers? By their octal representation? Then RegExp.escape("0") +"0"
will end up as "\300" … character number 192 then. Putting a backslash in front would mean a back-reference… so, what now?
EDIT:
Hmm... okay, we can hex-escape that then… that will work. So, turns out I was wrong a bit and it's indeed possible to properly escape it
@bwoebi As said, by encoding them. I am not terribly familiar with details of JS regex flavor, but presumably it has some kind of fixed-width escape sequence. Probably \xHH
or \uHHHH
.
@nikic yeah, see my EDIT, noticed that just now.
Numbers like 1
can be escaped like [1]
but I don't see a way to reliably escape things.
Adding a capturing group will mess with back-references.
Other solutions seem to have issues with quantifiers and context sensitive inserts. I don't see a way we can solve this from the library level - at least without parsing the RegExp ourselves first.
I'm not sure I fully understand all the issues here, but the sense I'm getting is that maybe just going for an escape-everything approach (#15) would solve things? (EDIT: I saw the doc about how other languages don't like that idea, interesting.)
In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.
Python is moving to a escape only metacharacters approach in its new
regex
module (replacing re
) by the way.
https://bitbucket.org/mrabarnett/mrab-regex/src/6193ea4246da272cf18a190c46aa116737067780/regex_3/Python/regex.py?at=default#cl-409
(Courtesy of Martijn Pieters)
What this issue points out is that escaping everything wouldn't work and that escaping needs to be context aware so it is the programmer who has to handle escaping in these cases.
The two other participants are PHP core contributors who I invited to participate in the discussion by the way :)
On Fri, Jun 19, 2015 at 6:13 PM, Domenic Denicola notifications@github.com wrote:
In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme. Then we can judge how problematic those are by asking if they're ridiculous edge cases or actually quite believable.
— Reply to this email directly or view it on GitHub https://github.com/benjamingr/RegExp.escape/issues/17#issuecomment-113545168 .
Speaking in formal, the suggestion here is basically this:
REPLACE
WITH
(*) Potential addition: "And c is the first element of cpList"
Btw, just for the record PHP also ignores the issue with backreferences and escapes only meta characters (albeit a larger number). Doing that has a lot of precedent in other languages.
In any case, I think the important thing to distill out of this thread is a list of failure scenarios for the current spec, and put those into the readme.
Definitely @domenic , I'm not sure the discussion has exhausted itself yet though.
@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"
It would also have the effect of making the regexp created less readable. Wouldn't it make more sense to just insert an empty capturing group into the RegExp if the first literal matches HexDigit?
@benjamingr @bwoebi Isn't ?:
how you denote a non-capturing group in JS regex? See RegExp.prototype.source
as an example.
@ljharb oh wow yeah, it is. I have no idea how I totally missed that while I used it last week at least twice or when I read the spec for .source
today.
Just prefixing and postfixing the RegExp with (?:)
would not solve all context sensitivity issues. For example, it would not work if you're inside a set match ([ ])
. See any way around it?
@nikic this would help with backreferences but would not help for example in matching sets or in other context sensitive cases - for example "[" + escaped +"]"
I'm not sure I see the issue you're referring to here, could you elaborate? Apart from use of -
in there, but that can be escaped as well.
@nikic as an expert, do you think the readability impact of the resulting regular expression is an important factor? If so - do you think this guarantee is worth it?
Superseding this with https://github.com/benjamingr/RegExp.escape/issues/29
Using something like
new RegExp("(foo)\\1" + RegExp.escape(input))
, ifinput
were to start with a number, this would extend the backreference\1
to something like\11
. Does this need to be accounted for?