tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
368 stars 32 forks source link

Consider RegExp template tag instead #45

Closed littledan closed 1 year ago

littledan commented 3 years ago

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature. This version would solve the user-facing problem more directly, avoiding the need to concatenate RegExps in the result, and can help engines avoid re-parsing (by just parsing the shared part once). If any part of RegExp syntax ends up requiring context-dependent escaping, a template constructor could resolve that in a way that context-independent RegExp.escape cannot.

Such a feature could look like this, to find the value of the variable x followed by the value of y with arbitrary whitespace in between:

RegExp.build`${x}\s+${y}/m`

Here, flags are provided after a / within the RegExp (which is of course optional).

ljharb commented 3 years ago

That was the expressed preference 5 years ago, and in that time, no delegate nor anybody in userland seems to have developed a template tag, as a proposal or a library, and userland seems quite convinced that the existing solution does already solve the user-facing problem.

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

littledan commented 3 years ago

Well, I think we should encourage people who want to do work in TC39 to develop this. I think the only evidence we have right now is that it hasn't been popularized yet; we're allowed to go beyond what popular libraries do.

ljharb commented 3 years ago

Indeed, but the problem that needs solving - and has needed solving for much longer than the 5 years since this proposal was rejected - has been consistently and completely solved by the low-level feature contained in this proposal. No need has manifested for anything more.

littledan commented 3 years ago

Yes, I agree that this problem is extremely important to solve.

oliverfoster commented 3 years ago

History of the tag vs escape quarrel: Present the proposal again?, Results of TC39 presentation, specifically comments such as this, this, this, this, this, this

For what it's worth from me: We currently have a working solution. One in the hand vs two in the bush.

benjamingr commented 3 years ago

I would prefer if, rather than pursuing a low-level RegExp.escape feature, we would work on a high-level RegExp templating feature.

Talking to users, we have found that almost everyone wants a RegExp.escape for the simple cases and no one asked for RegExp.tag, users I've spoken with found .tag very confusing as well.

littledan commented 3 years ago

It would be great if the proposal's README summarized these explanations. I was unaware of some of this context. I would be especially interested in hearing about developers who are confused by RegExp.tag--this is news to me.

To be clear, I'm not opposed to RegExp.escape. I was trying to motivate people to work on a higher-level construct. You've given good context for explaining why that's not a good idea. It's more important to solve this problem one way or another, than in the way that I aesthetically prefer.

ljharb commented 3 years ago

@littledan this proposal is 5 years old. The agenda item is not a proposal, it’s a discussion topic, during which context will be provided. Once this proposal is being actually proposed, then it will contain sufficient context in its documents.

littledan commented 3 years ago

I suggest you add an FAQ topic to explain the context here as well, to help people get prepared for the meeting and avoid repeated questions along the lines I made.

benjamingr commented 3 years ago

I've made a PR here: https://github.com/benjamingr/RegExp.escape/pull/46 just to start a FAQ, contributions welcome.

gibson042 commented 3 years ago

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

I know of at least XRegExp.build.

benjamingr commented 3 years ago

Wow 9 years ago https://github.com/slevithan/xregexp/commit/47b338da4a03eba60c4b6db22a1adde9d14900bf - also CC @LeaVerou

Edit: also relevant: https://github.com/slevithan/xregexp/issues/103

Edit2: Note that XRegExp.tag does not actually perform context sensitive escaping - it just calls .escape on every substitution.

ljharb commented 3 years ago

It would be really interesting (but likely impossible) to know how much of xregexp usage is .build.

leo60228 commented 3 years ago

I think that having both a RegExp.escape method that tries to handle as many escaping scenarios as possible and a RegExp.literal method providing a simple interface for interpolation would be the best route.

RegExp.escape could possibly be done by having a second argument specifying the regex up until the string being escaped, and if that argument is null or missing it would escape for all possible contexts.

benjamingr commented 3 years ago

@ljharb just a note - xregexp.build doesn't actually do what the suggested thing here does.

@leo60228 yeah and to be clear an .escape proposal doesn't stop any future .tag/.build efforts at all.

phpnode commented 3 years ago

This well-intentioned but entirely wrong idea derailed this proposal 5 years ago and we ended up without a solution at all. Please let's not do this again.

benjamingr commented 3 years ago

@phpnode the way the stage process works is that the committee considers alternatives. I think the best path forward is to make a good faith effort to consider alternatives.

The more I talk to users and language committees of other languages the more I lean towards .escape and not a tagged-template - but I am one (relatively inexperienced in language design) developer. By the way note that the two APIs are not mutually exclusive and nothing prevents .tag to land at some point in the future anyway.

I am frustrated by how things went 5 years ago, given the amount of work I put into this proposal back then. I think a better approach back then would have been a longer and more patient dialog with TC39 to understand how to push this* forward.

Worst case (for people who need RegExp.escape as the API) - if the committee decides against .escape then WHATWG members I spoke with were very optimistic about our ability to add this to the web at which point I will also add it to Node.js. It would be unfortunate to get another cross-platform web API like AbortController rather than something that's part of the language - so I feel strongly that we should explore TC39 standardisation first and do the hard work of getting buy-in from delegates.

*this - an API for escaping regular expressions, not necessarily RegExp.escape or a tagged template.

gibson042 commented 3 years ago

Do you have an example of a library that attempts to solve this? If there are none, I think that does serve as evidence that it would not, in fact, add any value. If there are some, it'd be great to learn about them.

Another example: https://runkit.com/tolmasky/templated-regular-expression-example

const { queryString } = templateRegExp(
{
    key: /[^=#]+/,
    value: /[^&#]+/,
    parameter: /${key}=${value}/,
    queryString: /\?(${parameter})(?:&(${parameter}))*/
});

queryString.test("?a=b&c=d")
benjamingr commented 3 years ago

I'm not sure how that's a demonstration of escaping? That appears to be just a way to create a RegExp from objects (I also have one from ±6 years ago I don't maintain here: https://github.com/benjamingr/js-restructure :)).

Is this it? https://www.npmjs.com/package/templated-regular-expression

gibson042 commented 3 years ago

Yes, that's the package. And it's not intended as a demonstration of escaping, but rather another response to the "Do you have an example of a library that attempts to solve [high-level RegExp templating]?" question above regarding this specific issue, which @littledan raised to "solve the user-facing problem more directly, avoiding the need to concatenate RegExps".

pygy commented 2 years ago

Letting users compose RegExps would be awesome. The RegExp syntax are optimized for small searchers, but the regular formalism makes it perfect for lexers, and that use case is very poorly served by literals. Likewise, backtracking makes sense for small searchers, but you can easily run into ReDOS when writing non-trivial parsers.

For previous art, there's alsoThere's also https://github.com/trusktr/regexr by @trusktr.

I have a personnal preference for JS-based combinators, as seen in compose-regexp. The compose-regexp functions accepts either string or RegExps as arguments, and strings are escaped. For completeness, it would require a charset builder with set operations.

The ${...} syntax for splicing in values doesn't cut it in RegExp context IMO. as both $ and {} already have meanings in RegExp context. Also, how would one interpret splicing values in character class context? That use case would be better served by a dedicated JS API IMO.

Alternatively one could imagine a syntax extension in RegExps for splicing in full patterns (the point of the regular formalism is composition, after all, that is at the core of the definition). Something like

const p1 = RegExp.quantifier("{4}", /\w/)
const p2 = RegExp.intersect(/\p{sc=Latin}/u, /\p{Uppercaser}/u)

const combined = /^(?:(?>p1)|(?>p2))$/

// compare with

const tCombined = RegExp.from`^$(?:{p1}|${p2})$`

// or with combinators

const cCombined = sequence(/^/, either(p1,p2), /$/)

Such a syntax would prevent splicing in character class and {} quantifier context at parse time. These use cases would be better served by a dedicated JS API as shown in the example above.

Edit, a bit rusty :-)

rauschma commented 2 years ago
cben commented 1 year ago

another template tag library: https://github.com/mikesamuel/regexp-make-js

pygy commented 1 year ago

Syntax aside, when composing, one has to take what's spliced in into account, or risk getting nonsensical results. Here's an example:

const ab = /a|b/
const c = /c/

import {re} from 're-template-tag';
console.log(re`${ab}${c}`) // /a|bc/

import {sequence} from "compose-regexp"
console.log(sequence(ab, c)) // /(?:a|b)c/
erights commented 1 year ago

I will object to any proposal that creates a new vulnerability to injection attacks. AFAICT, the means the only acceptable proposal is one based on tagged template literals.

@mikesamuel and I wrote and championed the tagged template literal proposal. We got them added to JS primarily to support writing context aware SAFE escapers of embedded languages. This is exactly the situation we're talking about here. RegExp is precisely such an embedded language, and safe escaping cannot be done in this language without context sensitivity.

At https://github.com/tc39/proposal-regex-escaping/issues/45#issuecomment-1379085585 , @cben links to

https://github.com/mikesamuel/regexp-make-js

This is the one I want to call everyone's attention to. It was written by @mikesamuel with substantial input from me, explicitly for the purpose of doing SAFE context aware escaping. It also accepts RegExp instances as substitution values, expressing nested matchers, not just nested data. This enables powerful composition of RegExps.

I would support a proposal based on this implementation, that preserves both its safety and its compositionality. I would also support a proposal that rejects in more odd contexts, rather than bothering to do context aware escaping for those odd contexts.

pygy commented 1 year ago

@erights could you give an example of an injection attack enabled by a non-ttl solution ?

E.g. how would you craft an attack on compose-regexp?

Edit: looking at the issues over at https://github.com/mikesamuel/regexp-make-js, I see a problem with sequence(/a{/, 1, /}/, which currently returns /a{1}/ rather than /a{(?:)1}/. It can be patched easily (edit2: done as of v0.6.31... | edit3: as of v0.7.0 lone quantifier brackets in input are rejected as syntax errors).

Edit4: @erights beside quantifier brackets and back references, do you have other "weird" scenarios in mind?

ljharb commented 1 year ago

This proposal advanced to stage 2 with RegExp.escape semantics. https://github.com/tc39/proposals/commit/cc2b37dcdbe378539d54b6901577ec5c2888fd3d