tc39 / ecma262

Status, process, and documents for ECMA-262
https://tc39.es/ecma262/
Other
15.06k stars 1.29k forks source link

Analysis of how implementations handle the required escaping in RegExp#source #578

Open claudepache opened 8 years ago

claudepache commented 8 years ago

This is a followup of https://bugs.ecmascript.org/show_bug.cgi?id=1470

I’ve made a first rapid analysis of how major web browsers implement the not-exactly-specified Step 2 of EscapeRegExpPattern. Recall that, for a regexp rx, we have approximatively rx.source = EscapeRegExpPattern(rx.[[OriginalSource]]). That transformation must not change the semantics of the pattern, but is required in order that

    eval("/" + rx.source + "/" + rx.flags)

produces a functionally equivalent regexp as rx.

Analysing the grammar that is used to determine the limits of a regexp literal, one can show that it suffices to:

(Note that, although /* is parsed as a beginning of multiline-comment rather of a regular expression, this is nonproblematic because a regexp cannot ever begin with *.)

The transformations used by the major browsers are detailed below, except that the line terminators are currently not escaped by Chrome (V8 Issue 1982).

Original source Transformed into
<LF>
\<LF>
\n
<CR>
\<CR>
\r
<LS>
\<LS>
\u2028
<PS>
\<PS>
\u2029
/ (outside RegularExpressionClass) \/
/ (inside RegularExpressionClass) / (Firefox, Safari)
\/ (Chrome, Edge)
empty pattern (?:)

It does not seems to me that implementations perform other transformations, but that needs confirmation.


In conclusion, the only major difference between implementations seems to be whether / is escaped everywhere or only outside RegularExpressionClass.

claudepache commented 8 years ago

Personally, I am for not escaping / inside RegularExpressionClass, because that has the property of preserving exactly the source text when it originated from a regexp literal.

ljharb commented 6 years ago

@claudepache could you perhaps prepare a PR for this?