tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
363 stars 32 forks source link

Complement with an instance method (RegExp.prototype.escape) #50

Closed Alhadis closed 3 years ago

Alhadis commented 3 years ago

(Sorry if this is written like an essay; I developed tunnel-vision halfway through writing it…)

Overview

Authors may need to escape a string for piecewise construction. RegExp.escape(…).source is insufficient, because input may not necessarily be a complete, syntactically valid regular expression. Ergo, I suggest providing an instance method that returns an escaped string following the same logic as RegExp.escape:

/regex/.escape(".") === "\\.";

Rationale

The reason I suggest adding an instance method (as opposed to another class method) is so authors can fine-tune how/where characters are escaped (possibly influenced by a well-known @@escape symbol, à la @@replace).

The definition of RegExp.prototype.escape is more-or-less along the lines of:

RegExp.escape = function(input){
    return new RegExp(this.prototype.escape(...arguments));
};

RegExp.prototype.escape = function(input){
    if(this && "function" === typeof this[Symbol.escape])
        return this[Symbol.escape](...arguments);
    return String(input).replace(/[/\\^$*+?{}[\]().|]/g, "\\$&");
};

Motivation

Subclasses of RegExp may have different expectations about what characters need escaping (and where). A realistic example is a third-party regular expression library imported as a set of functions, which are wrapped inside a subclass for more idiomatic (object-oriented) use.

Some actual code might make this clearer…

Example 1: Oniguruma Oniguruma uses `&&[…]` to denote an [intersection range][RE] within a character class, meaning that `[a-z&&[aeiou]]` has two different interpretations depending on the engine that's parsing it. ~~~js class OnigurumaExpr extends RegExp { escape(input){ input = RegExp.prototype.escape(input); return input.replaceAll("&&", "\\&&"); } } /** Return true if input contains an alphabetic character. */ function hasAlphaChars(input, additionalLetters = ""){ return new OnigurumaExpr(`[A-Z${ OnigurumaExpr.escape(additionalLetters) }a-z]+`).test(input); } hasAlphaChars("Café", "éñøüğȟ"); // Harmless hasAlphaChars("Café", "&&[^a-z]"); // Problematic ~~~
Example 2: Basic POSIX regular expressions (BREs) In legacy POSIX syntax, `\(…\)` and `\{…\}` have *opposite* meanings to `(…)` and `{…}`, respectively. ~~~js class BRE extends RegExp { [Symbol.escape](input){ return input.replace(/\\[({})\\1-9]/g, "\\$&"); } } BRE.prototype.escape("\\(A\\)-(Z)+?") === String.raw `\\(A\\)-(Z)+?`; ~~~
ljharb commented 3 years ago

I don't have any interest in expanding the problematic design of RegExp by adding another Symbol lookup, and I suspect that is an opinion shared by many on the committee.

Separately, it wouldn't make any sense to me to have an instance method that doesn't actually care about the instance except to look something up on the constructor.

A static method - whether a template tag or a .escape function - allows for the same customizability with RegExpSubclass.escape or RegExpSubclass.tag, or similar. Code wishing to support the exceedingly rare design pattern of regex subclasses can regex.constructor.escape as needed.

Alhadis commented 3 years ago

I don't have any interest in expanding the problematic design of RegExp by adding another Symbol lookup, and I suspect that is an opinion shared by many on the committee.

Forget about Symbol lookups then, what about returning a string with escaped metacharacters? Moreover, bindings to a third-party library typically take strings as arguments, and their syntax is rarely compatible with standard regular expressions (think of TextMate grammars, which are commonly powered by Oniguruma).

exceedingly rare design pattern of regex subclasses

Needing to escape a subset or superset of "special" regex characters isn't "exceedingly rare". Subclasses were only used as an example.

The very least you can do is add an optional parameter to specify characters to exclude from escaping.

ljharb commented 3 years ago

Are there existing userland patterns in JS you could point to where there's been a need to customize the escaped character list?

Alhadis commented 3 years ago

What are you referring to by "userland", exactly?

The crux of the issue is there's no way to safely return a string that's escaped consistently with RegExp.escape. Unescaping certain sequences can always come afterwards, I suppose.

ljharb commented 3 years ago

I mean, outside the language - typically, a package on npm in common usage.

Alhadis commented 3 years ago

I wouldn't know. It's been years since I've used NPM (or any other package manager) for anything other than globally-installing a command-line tool, so whatever flavour-of-the-month is doing its rounds in the ecosystem at the moment is completely unknown to me.

ljharb commented 3 years ago

Then in the absence of any demonstrated need for this pattern, and given the reasons I've outlined above, I'll close this for now.