tc39 / proposal-regexp-x-mode

BSD 3-Clause "New" or "Revised" License
24 stars 2 forks source link

Restrictions on what can go in inline comments #1

Open bakkot opened 2 years ago

bakkot commented 2 years ago

There is a base RegularExpressionLiteral grammar which we almost certainly can't change, and I think that will impose some limitations on what characters can be legal unescaped in comments. In particular, I think [ and] might need to be escaped, since the lexer thinks those represent character classes which must balance.

rbuckton commented 2 years ago

I imagine that comments are more likely to be used via the RegExp constructor than in a RegularExpressionLiteral, though I would like to support (?#) in literals. This might require [] be escaped in addition to ()/ in a literal, though escaping for anything other than () is unnecessary in a raw string passed to the RegExp constructor.

It also doesn't seem like it would be difficult to modify RegularExpressionBody to have an additional balancing construct for (?# and ), though it seemed there was reticence to such a change that I don't fully understand.

bakkot commented 2 years ago

I suspect @waldemarhorwat and @erights are the best people to speak to any potential issues with modifying RegularExpressionBody.

erights commented 2 years ago

Every time there is ambiguity in how to lex/parse JS, we have seen that enable attacks. For example, the ambiguity around html-like comments enabled an attack on Caja. Some approaches to filtering code for some safety relies on an accurate tokenization without requiring a parse. For example, Caja did a conservative scan for all identifiers that might be a use occurrence of a variable. Given accurate tokenization, this is guaranteed to include all use occurrences of free variables. (Caja was actually even more conservative, picking up all identifiers, even in comments and literal strings. But this was expensive. If we had a tokenizers we were confident in, we could have used that.)

There is other old software that relies on being able to tokenize accurately, even if it can no longer parse. For example, some simple syntax highlighters. For example, there is an old macro library for JS that relies on being able to tokenize, followed by being able to bracket match. Bracket matching is a parsing level constraint that has also not been broken by the continuing evolution of JS. In fact, tokenizing template literals requires bracket matching, so stability of tokenization does seem to require the bracket-matching level of parsing as well.

rbuckton commented 2 years ago

In plenary today I mistakenly said, "we should ban x-mode for RegExp literals". However, that would not satisfy the constraint here as x-mode does not affect the behavior of (?#). We could merely require that [ and ] be escaped when in a comment in RegularExpressionBody as is proposed in the OP, but they don't necessarily need to be escaped when passed as a string to the RegExp constructor, just as / must be escaped in a RegularExpressionLiteral but does not need to be escaped when passed as a string to the constructor.

waldemarhorwat commented 2 years ago

Having the RegularExpressionBody grammar not be a cover grammar (modulo /) for the real regular expression grammar would be terrible.

The invariant is that any valid regular expression according to the RegExp grammar that does not contain an internal / must also be a valid RegularExpressionBody expansion. We carefully engineered the v mode to retain this invariant.

Introducing (?#) comments without the requirement of escaping [ and ] will break the invariant. In /ab(?#[)/i]/m, the regular expression should end at the first / but RegularExpressionBody will also swallow up some of the following characters.