tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
363 stars 32 forks source link

Results of TC39 presentation #37

Closed domenic closed 3 years ago

domenic commented 9 years ago

I'm sorry to say that the committee declined to accept this proposal as-is. In the end, the concern (largely driven by @erights, although others were sympathetic) was that escaping cannot be done in a way that is totally safe in all cases, even with the extended safe set. For example,

new RegExp("\\" + RegExp.escape("w"))

is a hazard. (It does not matter that "\\" by itself is not a valid regex fragment. The above does not error to indicate that; it just silently creates a bug.)

Note that even if you attempted to correct this by escaping all initial characters, you then have

new RegExp("\\\\" + RegExp.escape("w"))

as a bug. @erights called this the "even-odd problem."

The general feeling was that to be completely safe you need a context-dependent join operation. The feeling was then that if author code wants to do unsafe escaping, the function is easy to write, but if something is going to be standardized, it must be completely safe. The idea that other languages are not held to this standard did not convince them, I'm sorry to say.

The committee recognized that you might not be willing to do work on a different, more complicated proposal. But, if you were interested, they think that a template string tag solution would be the best to investigate. Such a solution would be able to use the context of each insertion to do a safe join operation between any two fragments, instead of depending on string concatenation. Template strings can also be twisted to work in dynamic situations (e.g. those that this proposal would cover via new RegExp(pieces.map(RegExp.escape).join("/"))) by directly calling the tag function, probably with an adapter to work through the awkwardness of the parameter form for template tags. So this would be strictly more powerful. This was also preferred (for reasons I don't really remember) to a building-block approach of e.g. RegExp.concat plus RegExp.escape (used as RegExp.concat("\\", RegExp.escape(x))).

I'm pretty disappointed by this, and am sorry you and others sunk so much work into it with such an outcome. But, what can we do.

jdalton commented 9 years ago

I think most regexp escape methods are pretty simple. At least the ones in Closure Lib, Dojo, Lodash, Prototype, MooTools, Mout, YUI, & escape-string-regexp are.

This looks like a case of taking something with clear utility, boxing it into a context it wasn't intended for, then dumping it because it didn't fit the box.

WebReflection commented 9 years ago

But, what can we do.

we can ignore this decision and agree on a single standard method we, developers, can use, pointing at the possible footgun in a comment within the shipped code.

We put such utility in a CDN and we live happily ever after.

matthewrobb commented 9 years ago

This is certainly disappointing but I'm glad the results were posted here in this way. I love reading meeting notes but it would be great to have this kind of update on each proposal after discussions.

I'm okay with a template tag but still, this proposal would end up needed within that implementation so why not bring it in?

Doing URL escaping in the browser doesn't save a developer from SQL injection if they did something like the above but no one questions the usefulness of those existing escape functions.

WebReflection commented 9 years ago

Doing URL escaping in the browser doesn't save a developer from SQL injection

exactly, same as escaping HTML, or using even eval ... if you want to hurt yourself feel free to do it.

String concatenation is a reality, RegExp can be created on the fly within a Function too ... what the hack is the point of avoiding methods focused on increasing security?

Between totally unsafe and "maybe some noob will do something silly like that" there is quite an abyss.

Disappointing to say the least. Thanks gosh I don't care about these decisions

stuartpb commented 9 years ago

...if something is going to be standardized, it must be completely safe.

Uh, then why is the RegExp constructor itself standardized?

RegExp.escape is a function that reduces the attack surface of an existing component - if you're savvy enough to know that you need to use it, you're savvy enough to know it has its limits. The alternative to this is going to be more users rolling their own broken versions, thinking that [(*.)] are the only characters they need to escape (because it's the only characters they've ever seen used).

Saying that "avoiding a false sense of security" is going to help anything here is just plain wrong.

There has to be somewhere the general public can complain directly to the people who voted against this. Right?

More discussion at http://discourse.wicg.io/t/regexp-escape-str/832

stuartpb commented 9 years ago

The trouble I see with this is that it all comes down to arguments based on the behaviors of a hypothetical "average" developer nobody involved in the discussion would have any direct experience with, almost by definition.

The point that I would make is, what real code do they see ending up broken, that would not have been broken without its presence? The TC's argument, I feel, is that not having a RegExp.escape provided by the language would somehow drive developers to look up the documentation once they encounter this problem, and become enlightened - but, from my experience, for the type of developer who would think on sight that RegExp.escape is some kind of panacea, that's not what would happen. I find "average developers" (the type who would not realize RegExp.escape would have corner cases) tend to operate more in one of two modes:

And, without being given a RegExp.escape, they would not become enlightened as to how to do it correctly. They would (and do) try to fix it using the exact problem in front of them, using only what they've seen before, with as little thought into other cases where it would be a problem as possible - and that's going to generate a hell of a lot more corner cases than RegExp.escape ever would.

benjamingr commented 9 years ago

Thanks for the update. This is disappointing, I'll read the meeting notes and figure out how to go forward.

I definitely don't think that a completely safe tag is that way forward though.

erights commented 9 years ago

Historical note: When Domenic first proposed this, we all thought that the maximally escaped form could be used reliably. While we thought that, I was in favor of it. The even-odd example took us all by surprise. I changed my mind only once I understood the unfixability of the even-odd problem.

erights commented 9 years ago

I definitely don't think that a completely safe tag is that way forward though.

Hi Benjamin, why not?

RegExp.escape = (str) => RegExp.tag`${str}`;

I emphasize the word "help" above because the claim that we insist on something completely safe is not the way I would put it, because it leads to the misunderstandings seen earlier in this thread. Nothing will prevent developers from writing code unsafe in these regards. We're trying to make the safe way the easy way, so that a set of fairly straightforward design rules generally suffices to avoid these hazards.

The issue is simple. RegExp source text is a little mini programming language with its own escaping rules. We've already heard the following song before: Having functions that provide local escaping fit for some contexts combined with use of string append to compose together source text from fragments. This has led to the nightmare that pervades the net, of escaping errors and injection attacks in html, css, sql, and more. We took a memory safe language, JavaScript, and by use of these manual-escape-and-append patterns, we reiterated the buffer-overflow-like vulnerability at a higher level of abstraction.

Lisp, when embedded languages are composed of S-Expressions, is largely free of this entire class of vulnerability because they compose structure to create structure. All the relevant source text is first passed through a reliable reader and low-level universal parser before this composition happens. Lisp quasi-quote makes this especially pleasant. Of course, we do not have a universal textual syntax like S-Expressions that we're starting from. We created the template string mechanism (and E's earlier quali-literal mechanism) as a universal framework for solving the context-dependent escaping problem for embedded languages, each with their own surface syntax.

erights commented 9 years ago

Btw, in case it wasn't clear, I do not propose that the tag be named RegExp.tag. This is merely a standin for whatever the tag would actually be named.

stuartpb commented 9 years ago

Historical note: When Domenic first proposed this, we all thought that the maximally escaped form could be used reliably. While we thought that, I was in favor of it. The even-odd example took us all by surprise. I changed my mind only once I understood the unfixability of the even-odd problem.

I feel like you've let this flaw blindside you, then. The nature of this kind of channeling attack has been a known issue with escapement for years - while I'm not going to argue against the value of a better model that checks individual RegExp components for validity, I will argue that it's short-sighted to say such a solution is the only adequate one. The more an end-developer (particularly one who's busy and only wants lo learn what's necessary to do their job) has to learn and adapt to make their solution "correct", the more likely it is that they're going to disregard that "correctness" in favor of doing it the maybe-sometimes-wrong way they already understand.

RegExp.escape is something we already get. It's something we already want. It's something we already use, and already works with all the code we already wrote. It's something we already know how to use correctly to prevent channeling attacks - without it in the language, I'm more likely to implement a broken solution, because there's a lot more effort involved in going out and looking up how to polyfill this than in taking half-a-second, deciding "yeah, I'm pretty sure it doesn't matter", and not escaping the string at all. If I had a RegExp.escape at hand that I could take, check I'm not concatenating with anything that might include uncontrolled backslashes, and drop in with 13 keystrokes, this would be a non-issue.

benjamingr commented 9 years ago

@erights

Well, the thing is that RegExp is not the same sort of language as SQL or HTML in this regard, there have been zero "RegExp injection" attacks (in this sense), there is no security guarantee to uphold here. I've been very in favour of contextual escaping for languages where not doing so presents a security risk.

The problem is that making a tag severely complicates this proposal. I mind but am willing to do the extra work on my side, but this means it will much less likely get implemented in browsers, will take a lot more of the TC's time and so on.

So, while a safe RegExp.tag method is interesting and I'm willing to continue exploring it - I feel like it would require a lot of effort to solve a problem a tiny bit better and it would create a less simple API. There is no security risk here like SQLi or XSS - just the lack of support of truly esoteric use cases.

We have scanned the top million websites, have gone through tens of thousands of NPM packages and explored a total of over a billion lines of code. I did not see a single reasonable case that was not caught by the original "SyntaxCharacter" proposal.

This trade between pragmatism and theoretical soundness is something I think ES needs to do here, just like C#, Python, Java, Perl, Ruby and pretty much every other language did before it. Just because a theoretical edge case exists doesn't mean we have to entertain it or focus our efforts based on it.

erights commented 9 years ago

I see here two interesting arguments for advancing RegExp.escape

I understand the force of the first argument and agree that it needs to be addressed. But I do not accept the second argument. A short term fix that become a long term hazard is a common development strategy and often appears in libraries that become de facto standards. When no good long term solution is known at the time, they often become de jure standards as well.

However, when a better long term solution is known at the time, and subsumes all the functionality and convenience of the short term fix, I think the committee would be failing the community to admit the short term fix merely because the good long term solution is more work. Admitting it into the de jure standard means we're likely to be stuck with it forever. We're forever stuck with enough of these already.

On the convenience point, compare

RegExp.escape(str)

and

RegExp.tag`${str}`

Once both exist, is there any reason whatsoever to prefer writing the first to the second?

So, while a safe RegExp.tag method is interesting and I'm willing to continue exploring it

Thank you. I'll be happy to help.

WebReflection commented 9 years ago

I am really having hard time trying to understand what's the tag supposed to solve compared to escape ... this is valid ES6, right ?

new RegExp("\\\\" + RegExp.tag`${str}`)

How does the user that would apparently fail with the initial example, avoid failing with the tag too?

Thanks for any sort of clarification, I know TC39 has always reasons but this time I really can't figure out the what, the where, the when, and the how.

Best Regards

benjamingr commented 9 years ago

@erights

So, while a safe RegExp.tag method is interesting and I'm willing to continue exploring it Thank you. I'll be happy to help.

I'd love that, I think it would be interesting to apply the same formal analysis techniques you use in your work to show that the escaping is sound. I'm not sure it's something I'd put in the world's most popular programming language or that it would make a good API for people - but if you could point me in the right direction and tell me where to start this sounds like an interesting thing to do regardless of whether or not we build a language API out of it.


The arguments for RegExp.escape as I see it are:

An idea was brought up here before (I think by @cscott) to support a second argument in RegExp.escape allowing more escaped characters (this is the standard API in some other languages, namely PHP). This means that the RegExp.escaped (tag) proposal can be easily built on top of it. Would this meet your concerns about contextual escaping?

benjamingr commented 9 years ago

@WebReflection a tag solution would presumably look like:

new RegExp(RegExp.escaped`\\\\${"w"}`);

A cursor would walk through the RegExp, keep track of its state and choose not to escape the string in this case because it is aware there is no need for an escape. This fails for dynamic length regular expressions which are actually one of the primary use cases here (Presuming we don't introduce a whole context-free escaping language here with iteration :D).

bergus commented 9 years ago

@WebReflection the proposed solution would look like

 RegExp.tag `\\${str}`

instead of new RegExp("\\\\" + RegExp.tag${str})

matthewrobb commented 9 years ago

We created the template string mechanism (and E's earlier quali-literal mechanism) as a universal framework for solving the context-dependent escaping problem for embedded languages, each with their own surface syntax.

So why isn't regex literal syntax being deprecated? I understand and agree with the idea but this is cause diverging paths for how to use a feature with baked in syntax support. Rather than a tag (or in addition to) add an escaped-interpolation syntax to regex literals.

erights commented 9 years ago

So why isn't regex literal syntax being deprecated?

As far as I can tell, in the entire history of JavaScript after 1999 we have only succeeded at deprecating one bit of API: function.{caller,arguments} and arguments.{caller,callee}. And they are still dying a very slow death. I hope to outlive them, but we'll see.

Our inability to effectively deprecate is exactly why we should be careful about admitting short term fixes into the standard.

erights commented 9 years ago

I say "after 1999" because that was the year ES3 was ratified. Whether the statement is still true from earlier, I have no idea.

stuartpb commented 9 years ago

If the TC is looking to build something safe and Lisp-like into the language to work around the problems with RegExp I suggest adding a module for constructing a parsing expression grammar.

WebReflection commented 9 years ago

so we are all presuming that if it's a tag no user will ever do the mistake we are supposing the same user would do with escape ?

If the concern is a user that does this:

new RegExp("\\" + RegExp.escape(str))

why there is no concern on a user that could just do this instead?

new RegExp("\\" + RegExp.tag`${str}`)

It is a silly thing to do in the first case, I don't see why the second couldn't be as silly as the first one.

Whoever would write the first case intentionally will eventually write the second one based on tags too.

Whoever would move the text inside the tag could eventually do the same with RegExp.escape

So instead of new RegExp("\\" + RegExp.escape(str)) a user could do new RegExp(RegExp.escape('\\' + str))

I think the concern is a bit overrated in this thread, but I also believe TC39 won't change its mind so ... I will keep reading curious, and I will remain disappointed a part.

erights commented 9 years ago

If the TC is looking to build something safe and Lisp-like into the language to work around the problems with [...] I suggest adding a module for constructing a parsing expression grammar.

Funny you should mention that. I replaced your "RegExp" with "[...]" above because the issue is more general. We need to make it easier to define tags for parsing new template languages. See https://github.com/erights/quasiParserGenerator . But please keep in mind that this is currently just an experiment not yet ready for serious use.

erights commented 9 years ago

so we are all presuming that if it's a tag no user will ever do the mistake we are supposing the same user would do with escape ?

No. Earlier I wrote:

Nothing will prevent developers from writing code unsafe in these regards. We're trying to make the safe way the easy way, so that a set of fairly straightforward design rules generally suffices to avoid these hazards.

erights commented 9 years ago

I think it would be interesting to apply the same formal analysis techniques you use in your work to show that the escaping is sound.

Actually, all the formal work on these issues is @mikesamuel 's. Mike, care to comment?

benjamingr commented 9 years ago

@erights addressing the other issues raised in that post would be nice:

  • It's a de-facto standard, as @jdalton pointed out there are libraries with tens of millions of downloads that do this (with the less-strict escape set).
  • It's a real problem programmers have been having for years, and we can't up with non-contrived cases it does not solve (and boy we tried). Moreover exploring a huge number of code bases does not show anyone doing anything remotely close to "\\" + escape("w") - so it's a problem that exists in theory.
  • It's a simpler API for users, a fixed escape set is easy to reason about and we're only claiming that escape(s) matches s which matches the vast majority of use cases.
  • It's shimmable very easily, so everyone using non ES2015 code bases can use it. I get this might not a hold of weight for the TC but it means a lot to a lot of developers.
    • Composing a dynamic number of strings. This is probably the most common use case for RegExp.escape actually - building a routing table or a search pattern for multiple strings.

And the two parameter solution:

An idea was brought up here before (I think by @cscott) to support a second argument in RegExp.escape allowing more escaped characters (this is the standard API in some other languages, namely PHP). This means that the RegExp.escaped (tag) proposal can be easily built on top of it. Would this meet your concerns about contextual escaping?

erights commented 9 years ago

I disagree with

It's a simpler API for users

It is simpler to implement, but is no simpler for users.

From all the other points, it seems the de-facto process is successfully doing what you want. Given that, why do you need TC39 de-jure blessing?

I don't understand the two argument solution. Perhaps if you explained how the template tag could be easily built on top I would understand it better. As to whether this would change my mind, I'll wait until I understand before reevaluating.

mikesamuel commented 9 years ago

Sorry I'm late to this thread.

If people don't mind, I'd like to try to summarize the discussion so far to make sure we're on the same page.

  1. Some language support for dynamically created RegExp would help avoid errors.
  2. Some of the dynamic parts of RegExps come from untrusted inputs.
  3. There is no single RegExp.escape function we could define that would make composing regexs significantly less error-prone.
  4. A string template for RegExpr concatenation could take that into account.

It seems there are a number of unique contexts in regexps.

  1. Character sets: /[...]/
  2. After a \ which we could assume starts a back-reference: /(#+)bar\.../
  3. Possible, in a flag set. /foo/g...
  4. Inside a count: a{1,...}
  5. Where a non-capturing group like (?:xyz) can appear

so we might escape x differently in RegExpfoo${x}bardifferently from `RegExp`[a-zA-Z${x}].

And a number of kinds of inputs. A. Integral or decimal string B. String that is not an unsigned decimal number C. A RegExp instance

For example, RegExpfoo${ /[abc]/ }barmight be different from `RegExp`foo${ "[abc]" }bar and since \b means something different in string literals vs regex literals, we might treat RegExpfoo$( "\b" }differently from `RegExp`foo${ /\b/ }.

I've probably missed some, but trying to fill in a matrix might help see what feels dodgy as far as the principle of least surprise.

MATRIX String Integral RegExp
[...] individual characters as string form if it's a range inline it
\... consume \ and escape back-reference consume \ and inline
flags append to flags error error
count error inline error
inline escape within (?:...) as string inline within (?:...)
benjamingr commented 9 years ago

@erights

From all the other points, it seems the de-facto process is successfully doing what you want. Given that, why do you need TC39 de-jure blessing?

Because that's a big part of what a specification body does - spec de-facto solutions de-jura. Why specify String.prototype.includes or Array.from or promises? Most of the specified API has good user-land solutions - that doesn't mean we shouldn't standardise them. Even if they're not "interesting" they are still immensely useful to developers.

I don't understand the two argument solution.

It's pretty simple, RegExp.escape(string, additionalEscapes) where additionalEscapes is either a string or a RegExp of additionally escaped code points. That has the benefit of standardising the de-facto standard and we can later build the RegExp.escaped tag on top of it.

benjamingr commented 9 years ago

@mikesamuel

Thanks for that. I prefer we do not use the terminology "untrusted", there is no issue of trust or security being discussed here. Support for dynamic RegExp is already in the language and has been for more than 15 years, we're just discussing escaping strings to be taken literally as a RegExp.

In any case, I'd be interested in a PoC for a safe RegExp.escaped tag, I'm not convinced it's a good API for JS since it's solving a much bigger use case than people need. The fact no one has created anything like it in any other language while escape is omnipresent in other programming languages is a strong indication it has no strong practical use case. It would be really cool to work on regardless but I'm very skeptic about the inclusion of something like it in the standard.

Also, we can't escape things with capturing groups since that'd interfere with a lot of things like numbering, if we could we would not have the edge case Mark describes as the even-odd problem.

cscott commented 9 years ago

Having read through the discussion this far, I'll say I'm sympathetic to the idea that a strong set of regexp composition operators could/should be the underlying primitive: I proposed RegExp#alt in https://github.com/benjamingr/RegExp.escape/issues/29#issuecomment-117765112; for a complete set you'd also need #concat, #mult (with multiplicity), and #group (with options); see https://esdiscuss.org/topic/regexp-escape#content-36 .

If these operators allowed both RegExps and strings, then you could get your escape function implicitly without exposing it in an unsafe manner -- each result would be a fully-formed RegExp without dangling backslashes, etc. I think you'd probably need a way to construct a character class regexp from a string as well, which is a use case covered by our original RegExp.escape but which could be done "safer" by a dedicated method. (If you added RegExp.class(String), you'd also need methods to add/remove/invert character classes; this is trivial (but "unsafe") using RegExp#escape but you'd need dedicated methods and perhaps a new RegExpCharClass subclass of RegExp in the safe version.)

I'm not as convinced by the tag argument. I like the tag mechanism (and the es-discuss thread mentioned above even provides sample code), but I don't yet see how it handles certain use cases discussed for RegExp.escape (and which were solved by some of the larger escape sets).

Some use cases:

Further typical uses presuppose we have named groups at some point, otherwise pulling out the results of a complicated regexp operation could be error-prone:

function matchTrailingId(str, r1) {
  // Using tag syntax, but the same problem holds for all three proposed methods
  var result = re`${r1} ([a-z][a-z0-9]*)`.exec(str);
  return result[1]; // oops! what if r1 has a group?

Perhaps we need named backreferences, too:

r = re`${r1} (apples|oranges) \1`;

...although this particular case could probably be handled by a sufficiently sophisticated concatenation implementation.

Ok. So having explored the solution space for a bit, let me advocate for spec'ing out the concat/alt/mult/etc composition methods first as fundamental. (If RegExpCharClass is a subclass of RegExp this would be an interesting first attempt at using the new subclassability and thinking through the implications.) The re tag (whatever it's called) can be easily built on top of the fundamental composition methods, and composition methods taking string arguments seem to handle all the use cases for escape in a safe manner... albeit at a cost of >= 4 new API methods (exact count depending on how you handle character classes) instead of 1.

jdalton commented 9 years ago

I feel like things are scope creeping again.

The de facto method escapes ~14 characters. That's it. It's simple, it's popular, it has an understood use case. Other stuff could be considered extraneous.

benjamingr commented 9 years ago

In case it wasn't obvious from my comment here I'm with @jdalton . Completely safe escaping is interesting and I'd love to research it. It's not what I believe developers need right now in JavaScript and the fact we've investigated over a million code bases and the most minimal spec still covers all of them is a strong indication of it.

domenic commented 9 years ago

@matthewrobb

I'm okay with a template tag but still, this proposal would end up needed within that implementation so why not bring it in?

It's not clear that's true. The escaping at the edges of such strings would be quite different than that produced by this proposal. At some point the amount of reused code is very little.

And besides, TC39 does not want to expose building blocks if they are footguns.

Doing URL escaping in the browser doesn't save a developer from SQL injection if they did something like the above but no one questions the usefulness of those existing escape functions.

Actually, they do. Part of the reason tagged template strings were introduced was precisely because the existing functions are hazardous. It's the same reason you don't use mysql_escape and string concatenation; you instead use parametrized queries.

WebReflection commented 9 years ago

It's the same reason you don't use mysql_escape and string concatenation

sice we talk security, the right one would eventually be mysql_real_escape_string but parameters over multiple calls and concatenation are actually simply handy and easier to read, plus the result is usually less error prone so as side effect safer.

Here I see the whole thing ending up with libraries and utilities creating what Mark mentioned already

RegExp.escape = (str) => RegExp.tag`${str}`;

so I am not sure at the end of the story what would be the benefit for all of us: developers and TC39 effort to complete and standardize the tagged version ( while developers meanwhile would have learned that this or that utility do "the same" )

I guess it's a matter of pragmatism. Again I usually get TC39 intent, this time in my opinion went probably a bit too far. It could have been simple, and eventually improved if cases of footguns would start coming out.

Meanwhile the footgun is the only solution we have to escape safely, and through non native, or even naive, implementations.

matthewrobb commented 9 years ago

I guess it's a matter of pragmatism.

It seems as though there are two forces always in play when it comes to this stuff

  1. What should the future look like? (idealistic)
  2. Where do we go from where we are today? (pragmatic)

It appears as though this decision was made favoring 1 over 2.

Meanwhile the footgun is the only solution we have to escape safely, and through non native, or even naive, implementations.

^yes

cscott commented 9 years ago

I still don't see how the tag-based solution handles "create a regexp matching this array of strings", except perhaps via repeated concatenation:

r = arr.reduce((p,c) => RegExp.tag`${p}|${c}`, RegExp.prototype);

That implementation seems very likely to be $O(n^2)$, for an array of $n$ strings. Am I wrong?

domenic commented 9 years ago

You assemble the arguments list dynamically and call RegExp.tag as a function instead of using it with backticks.

cscott commented 9 years ago

@domenic you mean:

r = RegExp.tag(arr.map((_,i) => i ? '|' : ''), ...arr);

I guess that works.... but yuck.

I much prefer:

r = new RegExp(arr.map((s)=>RegExp.escape(s)).join('|');

but I guess that's a matter of taste.

(Well, arr.reduce((p,c) => p.alt(c), RegExp.prototype); is nice, too, but I worry it might have the same $O(n^2)$ problem.)

ps. I wonder whether RegExp.escape = (x) => RegExp.tag('', x).source; might become common? And if so -- what should it return? Aren't we just kicking the "set of escaped characters" bucket down the road?

ghost commented 6 years ago

What I don't understand about this discussion:

Shouldn't it be super obvious you can't concatenate RegEx.escape strings or insert parameters into them willy-nilly? It could also be easily added into the documentation to point this out, just to make it super obvious. I've never seen software attempt to do that, and as long as this corner case is documented, I don't understand how RegEx.escape is insecure at all. It simply isn't suitable for all complex use cases, but neither is it in other languages - and as long as it is used inside its limitations (which satisfy 99% of all use cases) it's completely fine.

As for the tagging solution, I think that would be nice as an additional(!) mechanism. However, I think the need for parameters in such escaped regex strings is quite overestimated in this discussion here. I've never seen or encountered a case where this was ever needed - either it's usually a prebaked safe expression with custom parameters, or the expression is user-provided and needs to be RegEx.escape and can't have parameters put into it as well. I've personally never seen a case that requires both, which is all this discussion appears to be about.

TL;DR: I think this is a nice theoretical issue, but it seems to me this discussion is severely over-complicating things, assuming a use case I've never seen. Also, IMHO developers should be trusted to understand the limitations of the API if those are well-documented and seen in all other languages similarly.

nicolashenry commented 5 years ago

I am not sure if it has already been discussed here but it could maybe easier to introduce parameterized regexps instead of trying to escape it manually with string concatenation?

For example by introducing a new kind of non-capturing group (the following would give an invalid group error currently):

const stringParam = 'myString';
const myParameterizedRegexp = /.*(?$stringParam).*/;

which could be done like this through the constructor:

const stringParam = 'myString';
const myParameterizedRegexp = new Regexp('.*(?$stringParam).*', { stringParam });

And it would maybe also enable regexp compositing which would be really convenient to reuse regexp in other regexps:

const regexpParam = /mySubRegexp/;
const myParameterizedRegexp = /.*(?$regexpParam).*/;

I guess that this is not so easy to implement but this is something which I really miss in the "RegExp world".

rugk commented 5 years ago

@nicolashenry As nobody replied to your proposal, maybe it's a better idea to open a new issue?

Ediz: Or wait, this here is still open…

ghost commented 5 years ago

Well that would be very useful, but it is still way more complicated to understand than a RegExp.escape, @nicolashenry .

So I think the people involved here into this decision really should do some reflection and possibly reopen the process after readjusting the unattainable goals they have given themselves, which overcomplicated this way beyond a reasonable to implement or use API. (which, no offense, is also kind of the problem with your proposal regarding this. Nice idea, some people certainly will use it, but way overshooting a simple & understandable escaping function)

If the involved people are going to keep not dealing with the reasonable arguments, at least consider this one: these functions exist already, and they are already being used and always will be, manually written, which means they will always either be massively underperforming due to conservatively over-escaping for all javascript implementations, or have big security holes due to missing something. Is this really where we want to be at?

Edit: sorry, scrolling up I realize this may be somewhat in the wrong place to complain about the original decision process in a way that actually reaches the involved people. Does someone know where this feedback could be directed?

arcanis commented 5 years ago

I understand the force of the first argument and agree that it needs to be addressed. But I do not accept [the extra work and delay for getting RegExp.tag specced and implemented]. A short term fix that become a long term hazard.

@erights In the four years since this discussion happened, how many new codebases have started to implement their own unsafe escape, or perhaps no escape at all? How many new ones will start doing it until a new proposal is done or this one resumed? Where do you put the limit?

I'm all for finding the right solution, but advocating against one of them on the basis that a theoretical better one might exist without ensuring that the resources to reach this improvement are there seems much more problematic to me.

This is particularly frustrating to me that I see no reason why RegExp.escape would be conflicting with a tag approach. The tag approach, if it ever comes to life (and it seemingly won't), could have leveraged RegExp.escape to escape its variables, and applied an extra logic on the boundaries. You conflated the two, and now we have none.

l-cornelius-dol commented 3 years ago

I came here from MDN which provides an ad-hoc escape function and mentions this defeated proposal. All I can say is I'm stunned at TC39's decision on this and, in my opinion and considerable experience, @erights is just plain wrong. More specifically, Mark does raise some valid concerns, but is wrong on the merits.

I do agree that the committee wants to avoid standardizing bad proposals, but I think that in this case they are wrong in the reasoning as to why this is a bad proposal. No matter how idiot-proof the API, the world will produce idiots capable of misusing it -- and that kind of idiot is I think is what is wrongly being protected here at the expense of developers around the world.

In summary, the affirmative arguments offered here greatly outweigh @erights arguments against them.

LazaroOnline commented 3 years ago

The responsibility of RegExp.escape() has been misunderstood to a point that the committee gave it so many responsibilities that it is almost impossible to implement convincingly, when in reality what we developers need is quite simple.

The REAL RESPONSIBILITY should be this:

Given any string, it should return a pattern that matches exactly those characters, ESCAPING ALL special characters.

The main unit test required to validate the implementation would be:

RegExp(RegExp.escape(inputString)).test(inputString)  // true

For any text in inputString the result should be true.

Apart from that, the only foreseeable room for discussion would be whether or not to escape some special characters that only have special meaning inside of wider patterns, like "]" or "-". Given that a string is often regarded as a bag of individual characters, it makes sense to also aim to pass other unit-test for common scenarios like this:

RegExp("(["+RegExp.escape(inputString)+"]+)").test(anySingleCharFromTheInputString)  // true

That is basically what MDN successfully did in their example here. As a reference for other options see this stack-overflow.

Regarding the even-odd issue, It falls OUTSIDE of the responsibility of a function whatever MODIFICATIONS you do to the result. Preventing users from accidentally misusing it is a NICE-TO-HAVE feature, not a requirement. If they want to make a function that is aware of how you are concatenating or what hat you wear to the office that day by creating a possibly over-engineered solution, then call it something else and do it or don't do it independently and after implementing the long awaited simple Regex.escape() proposal.

The committee should re-consider the proposal in favor of the entire world, to make javascript a safer language by making it easier to write safe code.

benjamingr commented 3 years ago

@LazaroOnline see the discussion in https://github.com/benjamingr/RegExp.escape/issues/43

benjamingr commented 3 years ago

I think this issue is fine to close now :]

l-cornelius-dol commented 3 years ago

I think this issue is fine to close now :]

Could not disagree more.

benjamingr commented 3 years ago

@lawrence-dol please enlighten me why oh why the "result of the TC39 presentation from 5 years ago" should remain open in my repo and telling people to instead check the issue pointing to a new TC39 champion wanting to pick this up and an alternative path of how we're going to get this (or something better anyway) is somehow bad?