tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard

Creative Commons Zero v1.0 Universal

363 stars 32 forks source link

Which leading characters should be escaped? #66

Closed gibson042 closed 4 months ago

gibson042 commented 9 months ago

As noted in https://github.com/tc39/proposal-regex-escaping/issues/58#issuecomment-1827185269 , an unescaped non-digit leading character could still be interpreted as part of an escape sequence spanning concatenated RegExp.escape output.

consider e.g. new RegExp("\\c" + RegExp.escape("J")) in a web browser implementing |ExtendedAtom|, where the result should match "\\cJ" and [unlike /\cJ/] fail to match "\n"

bergus commented 9 months ago

The explainer argues that the output of RegExp.escape is not put "in a place where it would obviously mean something else". See also #17 and #29.

gibson042 commented 9 months ago

29#issuecomment-118876667 is a concise explanation of affected scenarios. However, I don't see a clear decision on this question in either that issue or #37.

ljharb commented 9 months ago

cc @erights and @bakkot for thoughts?

bakkot commented 9 months ago

I'm inclined to say that the context after \c is similar to the context after \x, in that if you see either \x${v} or \c${v} you should know that the output of v is at least potentially going to be used as part of the escape sequence. There is no other reason to write those strings.

It's not like \1${v} where the \1 is a coherent thing to write on its own, such that you'd be surprised for the ${v} to become part of the escape sequence.

ljharb commented 9 months ago

That matches my intuition here.

gibson042 commented 9 months ago

I'm inclined to say that the context after \c is similar to the context after \x, in that if you see either \x${v} or \c${v} you should know that the output of v is at least potentially going to be used as part of the escape sequence. There is no other reason to write those strings.

It's not like \1${v} where the \1 is a coherent thing to write on its own, such that you'd be surprised for the ${v} to become part of the escape sequence.

\x and \c are in fact coherent on their own or whenever not followed by content that completes an escape, although I agree that the intent of either is generally (and pretty much exclusively) to express such an escape. But I would nonetheless be surprised if whether \x${v} or \c${v} is an escape depends upon the value of v (and surprised in the same was as \x41 vs. \x4x and \cJ vs. \c=, plus one level of indirection).

bakkot commented 9 months ago

The question is whether that surprise is a problem we should solve. I don't think it is. There is no reasonable expectation for what behavior you're going to get if you write \c${RegExp.escape(v)}.

I think it's fine to say that the output of RegExp.escape is not safe to use in a place where it would obviously mean something else, and that "immediately after \x" and "immediately after \c" are such places.

ljharb commented 9 months ago

Sounds like we're in agreement here, so I'll close this, but will reopen if that's inaccurate.

ljharb commented 5 months ago

Reopening pending confirmation from @gibson042 and @erights.

erights commented 5 months ago

To clarify for those watching this thread. Template tags would enable safe context sensitive escaping. I originally objected to RegExp.escape because it does not. I knew that there was an unsolvable even-vs-odd backslash problem, and thought there were others. After @bakkot clarified that the only such hazard was even-vs-odd backslash, I felt this was safe enough, because the exceptional unsafe case was narrow, easy to state, easy to remember, and easy to check by eyeball. To get there, as far as I am concerned, we gave up on the goal that the output of RegExp.escape be readable.

Later on in the current tc39 meeting, we'll discuss Array.isTemplateObject. Whatever my opinion on this specifically is (which is complicated), I strongly support the motivation for the proposal: To enable programmers (authors and reviewers) to reason clearly about the distinction between the literal parts that were authored as part of the program, vs attacker-controlled less-trusted data handled by that more-trusted program.

Therefore, I feel strongly that the first character needs to be adequately escaped to restore that simple safety property. My concern does not hinge on whether one would write \c${v} on purpose. It might be written by accident, in which case the RegExp may be buggy, in the sense that it does not mean what its author thought it meant. But even that case should not compromise this safety property. Reviewers who don't otherwise care about what the RegExp means should still be able, by eyeball and without an unrealistic memory burden, to reason about the impact of attacker-controlled less-trusted data, simply.