Analysis of how implementations handle the required escaping in RegExp#source

claudepache commented 8 years ago

This is a followup of https://bugs.ecmascript.org/show_bug.cgi?id=1470

I’ve made a first rapid analysis of how major web browsers implement the not-exactly-specified Step 2 of EscapeRegExpPattern. Recall that, for a regexp rx, we have approximatively rx.source = EscapeRegExpPattern(rx.[[OriginalSource]]). That transformation must not change the semantics of the pattern, but is required in order that

    eval("/" + rx.source + "/" + rx.flags)

produces a functionally equivalent regexp as rx.

Analysing the grammar that is used to determine the limits of a regexp literal, one can show that it suffices to:

escape the four line terminators in all positions;
escape the character / outside RegularExpressionClass; and
hack the empty pattern.

(Note that, although /* is parsed as a beginning of multiline-comment rather of a regular expression, this is nonproblematic because a regexp cannot ever begin with *.)

The transformations used by the major browsers are detailed below, except that the line terminators are currently not escaped by Chrome (V8 Issue 1982).

Original source	Transformed into
`<LF>` `\<LF>`	`\n`
`<CR>` `\<CR>`	`\r`
`<LS>` `\<LS>`	`\u2028`
`<PS>` `\<PS>`	`\u2029`
`/` (outside RegularExpressionClass)	`\/`
`/` (inside RegularExpressionClass)	`/` (Firefox, Safari) `\/` (Chrome, Edge)
empty pattern	`(?:)`

It does not seems to me that implementations perform other transformations, but that needs confirmation.

In conclusion, the only major difference between implementations seems to be whether / is escaped everywhere or only outside RegularExpressionClass.

claudepache commented 8 years ago

Personally, I am for not escaping / inside RegularExpressionClass, because that has the property of preserving exactly the source text when it originated from a regexp literal.

ljharb commented 6 years ago

@claudepache could you perhaps prepare a PR for this?

tc39 / ecma262

Analysis of how implementations handle the required escaping in RegExp#source #578