tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

Returning escaped input as a string #51

Closed Alhadis closed 1 year ago

Alhadis commented 3 years ago

It should be possible to escape metacharacters without instantiating a RegExp. Writing RegExp.escape(…).source isn't sufficient, because it assumes the result is a well-formed ECMAScript regular expression. Users wanting to construct regexes piecemeal, or those forced to use string values to accommodate foreign regex syntaxes (e.g., Oniguruma/TextMate-compatible grammars), are left with two partial workarounds:

  1. Do something hacky like [...string].map(char => RegExp.escape(char).source).join("")
  2. Write a helper function that reimplements logic defined by the ECMAScript standard, and keep it in-sync with future revisions

Nobody wants to resort to either of those things, but there doesn't appear to be a better way (that won't throw a SyntaxError, that is).

(I originally brought this up in #50, but probably put too much emphasis on the need to escape individual characters, as opposed to having more control over the escaped result. I've rewritten this to be clearer)

/cc @ljharb

ljharb commented 3 years ago

The RegExp.escape solution would produce a string, not a regular expression. Only the RegExp.tag solution would produce an entire regex.

Can you elaborate on the use cases for needing a string?

Alhadis commented 3 years ago

Can you elaborate on the use cases for needing a string?

I listed two use-cases in the OP (underlined below):

It should be possible to escape metacharacters without instantiating a RegExp. Writing RegExp.escape(…).source isn't sufficient, because it assumes the result is a well-formed ECMAScript regular expression. Users wanting to construct regexes piecemeal, or those forced to use string values to accommodate foreign regex syntaxes (e.g., Oniguruma/TextMate-compatible grammars), are left with two partial workarounds:

Are you pressing for actual, working code samples exemplifying the aforementioned use cases?

ljharb commented 3 years ago

The proposal currently has a lot of pushback against the form that returns a string, so it's particularly valuable to have justifications to prefer the string form.

Why would you want to construct a regex piecemeal? (i don't think RegExp would need to concern itself whatsoever with nonstandard grammars).

Alhadis commented 3 years ago

Nowhere did I imply that the proposed semantics ought to change, only that

It should be possible to escape metacharacters without instantiating a RegExp

A solution could take the form of

ljharb commented 3 years ago

Right, but "it should" is a claim, and I'm asking for justification of that claim. Why should it be possible?

Alhadis commented 3 years ago

Why should it be possible?

Imagine constructing several similar (but distinct) regular expressions that share common subpatterns. Some of these patterns are generated from user-supplied input, but also contain actual RegExp syntax. E.g.,

// User-supplied arguments
let titleArgument = "foo";
let searchQuery   = "Something that appears inside a search result";

// Program logic
const title = `(?<title>${ RegExp.escape(titleArgument) })`;
const desc  = `(?<description>.*?${ RegExp.escape(searchQuery) }.*?)`;
const dash  = `\\s*[-–—⸻]\\s*|\\s+`;
const shortSearchResult = new RegExp(`^\\s*${title}${dash}${desc}$`, "gm");
const longSearchResult  = new RegExp(`^\\s*${title}\n${desc}\n{2,}`, "g");
…

If this still doesn't strike you as ample justification, I invite you to rewrite the above using code you feel is more appropriate.

Alhadis commented 3 years ago

@benjamingr Any feedback?

ljharb commented 3 years ago

Hmm. With the tag approach, you'd do the escaping once for each regex, like this:

const shortSearchResult = RegExp.tag`/^\\s*(?<title>${titleArgument})\\s*[-–—⸻]\\s*|\\s+(?<description>.*?${searchQuery}.*?)$/gm`;
const longSearchResult  = RegExp.tag`/^\\s*(?<title>${titleArgument})\n(?<description>.*?${searchQuery}.*?)\n{2,}/g`;

I'm not sure how strong an argument it is that some of the pieces of two different regexes wouldn't have to be repeated.

benjamingr commented 3 years ago

@Alhadis This issue has been open for less than a day. I want to think about this :]

(Thank you for opening an issue and bringing up a concern!)

Intuitively: what you are describing is the "current" RegExp.escape proposal (it returns a string). I am personally a fan of it and think it's the simplest API as well as the one every other programming language I checked picked.

That said, I am honestly very open to alternatives - the fact other languages picked this API doesn't mean it's the best one - so I want to think about this more.


@ljharb the issue is composing these, notice how you ended up inlining title and desc and dash? I think that's the point @Alhadis is trying to make. In order for the code to be similar in terms of composition - .source would have to be used - which would make RegExp.tag potentially unsafe. Intuitively this is actually a pretty important point.

benjamingr commented 3 years ago

Also to clarify @Alhadis - while I work on this proposal the opinion that counts here in terms of what ends up being done is the committee's. I just wanted to make that clear.

I intend to keep working on this and am open to suggestions/ideas but am currently strongly in favour of RegExp.escape (and not tag) but I don't call the shots here.

The person championing this proposal from TC39 is @ljharb (and in the past Domenic who helped a bunch). You can read more on how features get added to JavaScript here.

ljharb commented 3 years ago

I agree, but I think the hard part would be persuasively arguing that it's appropriate to build regular expressions via composition.

Alhadis commented 3 years ago

persuasively arguing that it's appropriate to build regular expressions via composition.

@ljharb Maybe for you, but I doubt you work with regular expressions as frequently and heavily as I do. Trust me on this.

That said, I am honestly very open to alternatives

@benjamingr What about RegExp.from()? If passed a string, it could return a RegExp that matches the argument verbatim. I remember seeing .from() mentioned in another proposal somewhere, so it might be out-of-scope for this proposal.

ljharb commented 3 years ago

@Alhadis it's not just about my trust; it's that all of the TC39 delegates have to be convinced of this too. Either way, a persuasive argument is one that can be conveyed to folks that aren't as familiar with the subject - if it requires the familiarity to understand the argument, it's a much harder thing to convince folks of the necessity of.

Alhadis commented 3 years ago

Alright, but you mentioned here that

I'm not sure how strong an argument it is that some of the pieces of two different regexes wouldn't have to be repeated.

which, yes, is a pretty weak argument for the contrived example I slapped together. 2 patterns isn't the sort of problem I had in mind, though.

Try to imagine 5 or 10 patterns, possibly returned from different class methods, which nobody in their right-mind wants to keep in-sync manually with copy+pasta (D.R.Y. and all that…)

Alhadis commented 3 years ago

Moreover, the term "escape" has a pretty lucid meaning among coders, who probably don't expect a function of that name to also instantiate a RegExp as well. It'd be a bit like encodeURI() returning a URL object instead of an encoded string: it's doing more than what it advertises, which strikes me as a poor design decision.

ljharb commented 3 years ago

Sure, which is why the tag function form wouldn't be named "escape". RegExp.escape would only ever produce a string.

Alhadis commented 3 years ago

IMHO, that's an even worse approach. You're naming a method after its input, not its function. Even with an awareness of template-tagging, I'd read RegExp.tag and think of something related to HTML tags (or IETF language tags, or... really anything that describes what (and not how) the method is used.

It also makes no sense to mandate template literals over other types of strings. Sure, it might be more expressive, but that's no reason to make it so specific.

Alhadis commented 3 years ago

Personally, I'd call it this:

RegExp.for(characters);

Which can be read as:

RegExp.for(/* matching these */ characters);

My second choice is RegExp.from(), but only because it's consistent with other class methods whose names begin with from… (Object.fromEntries, Array.from, et cetera). There's also also the vague connection that each of these methods takes a different object type, transforms or interprets it in some fashion, and returns instance of itself. Though I admit this is a pretty weak rationale—hence I'm leaning strongly in favour of RegExp.for().

benjamingr commented 3 years ago

I would prefer that we focus on the compelling use case for composing strings as regular expressions rather than bikeshedding the name of RegExp.build/tag or RegExp.escape.

I think proposals for alternatives are welcome but should probably go in a separate thread.

msikma commented 3 years ago

I'm a bit confused, maybe someone could clarify. As I understand the proposal, RegExp.escape() does return a string already, right? It's a static method, so you're not instantiating a RegExp to do this.

Is there a change to the proposal where the result of RegExp.escape() is a RegExp object?

oliverfoster commented 3 years ago

The readme is pretty self explanatory. It is a string https://github.com/tc39/proposal-regex-escaping#regexpescape-function

ljharb commented 1 year ago

RegExp.escape, which returns a string, advanced to stage 2 today.