tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

RegExp Escaping Proposal

This ECMAScript proposal seeks to investigate the problem area of escaping a string for use inside a Regular Expression.

Formal specification

Champions:

Status

This proposal is a stage 3 proposal and is awaiting implementation and more input. Please see the issues to get involved.

Motivation

It is often the case when we want to build a regular expression out of a string without treating special characters from the string as special regular expression tokens. For example, if we want to replace all occurrences of the the string let text = "Hello." which we got from the user, we might be tempted to do ourLongText.replace(new RegExp(text, "g")). However, this would match . against any character rather than matching it against a dot.

This is commonly-desired functionality, as can be seen from this years-old es-discuss thread. Standardizing it would be very useful to developers, and avoid subpar implementations they might create that could miss edge cases.

Chosen solutions:

RegExp.escape function

This would be a RegExp.escape static function, such that strings can be escaped in order to be used inside regular expressions:

const str = prompt("Please enter a string");
const escaped = RegExp.escape(str);
const re = new RegExp(escaped, 'g'); // handles reg exp special tokens with the replacement.
console.log(ourLongText.replace(re));

Note the double backslashes in the example string contents, which render as a single backslash.

RegExp.escape("The Quick Brown Fox"); // "The\\ Quick\\ Brown\\ Fox"
RegExp.escape("Buy it. use it. break it. fix it.") // "Buy\\ it\\.\\ use it\\.\\ break\\ it\\.\\ fix\\ it\\."
RegExp.escape("(*.*)"); // "\\(\\*\\.\\*\\)"
RegExp.escape("。^・ェ・^。") // "。\\^・ェ・\\^。"
RegExp.escape("😊 *_* +_+ ... 👍"); // "😊\\ \\*_\\*\\ \\+_\\+\\ \\.\\.\\.\\ 👍"
RegExp.escape("\\d \\D (?:)"); // "\\\\d \\\\D \\(\\?\\:\\)"

Cross-cutting concerns

Per https://gist.github.com/bakkot/5a22c8c13ce269f6da46c7f7e56d3c3f, we now escape anything that could possible cause a “context escape”.

This would be a commitment to only entering/exiting new contexts using whitespace or ASCII punctuators. That seems like it will not be a significant impediment to language evolution.

Other solutions considered:

Template tag function

This would be, for example, a template tag function RegExp.tag, used to produce a complete regular expression instead of potentially a piece of one:

const str = prompt("Please enter a string");
const re = RegExp.tag`/${str}/g`;
console.log(ourLongText.replace(re));

In other languages

Note that the languages differ in what they do (e.g. Perl does something different from C#), but they all have the same goal.

We've had a meeting about this subject, whose notes include a more detailed writeup of what other languages do, and the pros and cons thereof.

FAQ

Polyfills