tc39 / proposal-regex-escaping

Proposal for investigating RegExp escaping for the ECMAScript standard
http://tc39.es/proposal-regex-escaping/
Creative Commons Zero v1.0 Universal
363 stars 32 forks source link

Consider leading flags instead of `/${string}/flags` for RegExp.tag #54

Closed rbuckton closed 11 months ago

rbuckton commented 2 years ago

The RegExp modifiers proposal originally included an unbounded (?ims-ims) operator, but that was recently dropped from the proposal. Some RegExp engines support a version of (?ims-ims) that can only appear at the start of a regular expression and applies to the entire pattern.

I've been considering re-introducing this prefix-only form in a follow-on proposal, and it could be helpful here as well:

RegExp.tag`(?ux)
  # allows x-mode comments
  \u{000a} # and unicode-mode code points
`;

const pattern = "(?i)test";
const re = new RegExp(pattern);
re.ignoreCase; // true
re.test("TeST"); // true

The prefix-flags form could theoretically allow all RegExp flags, not just the restricted subset in the Modifiers proposal. It would also remove the necessity for RegExp.tag to require leading and trailing / characters, and could even improve composability:

RegExp.tag`(?x)
  # escaped, case sensitive
  ${string} 

  # nested RegExp is *not* escaped. Supported flags are preserved. Unsupported flags are ignored.
  ${caseSensitive ? /Z/ : /Z/i} 
`;

In this case, flags from nested RegExp objects such as i, m, and s could be preserved in the resulting RegExp using a modified group (i.e., (?-i:Z) or (?i:Z) based on the condition above).

ljharb commented 2 years ago

To be clear, the ecosystem isn't interested in a tag function, it's just a few delegates that are.

With a tag function, I think it would be much clearer to mimic regex literal syntax as much as possible - which includes flags going last, like in the RegExp constructor.

rbuckton commented 1 year ago

To be clear, the ecosystem isn't interested in a tag function, it's just a few delegates that are.

With a tag function, I think it would be much clearer to mimix regex literal syntax as much as possible - which includes flags going last, like in the RegExp constructor.

I'm not sure I agree. If both new RegExp("/") and new RegExp(String.raw`/`) match the literal string "/", then RegExp.tag`/` should as well or it would be a refactoring hazard. It also adds unnecessary complexity to essentially require double-delimiting a regular expression: RegExp.tag`/foo/`.

I think RegExp.tag`(?u)foo` is not only far clearer, but also allows the reader to know what flags are in play at the start of a potentially multi-line RegExp, rather than needing to scan to the end to find the flags.

gibson042 commented 1 year ago

I agree with @rbuckton, and there are also other possibilities in the design space that avoid embedding what looks like a full and essentially double-delimited regular expression literal rather than just a body. For example,

RegExp.tag("u")`^\p{ID_Start}\p{ID_Continue}*$`

(although my preference is for the simpler body-only RegExp.tag`(?u)^\p{ID_Start}\p{ID_Continue}*$`)

ljharb commented 1 year ago

I feel pretty strongly here that we need to optimize for readability, not writability - the more different it looks from a regex literal, the harder it will be to explain to newcomers how to understand code that uses it.

rbuckton commented 1 year ago

I feel pretty strongly here that we need to optimize for readability, not writability - the more different it looks from a regex literal, the harder it will be to explain to newcomers how to understand code that uses it.

My previous comment specifically pointed out that I think a leading (?imsuvx) improves readability, since you don't need to scan to the end to find the flags. I also think double-delimiting via `/ ... /` does not improve readability, but instead is more likely to add confusion. @gibson042's suggestion of RegExp.tag(flags)`pattern` is reasonable, but has the downside of allocating a new function closure for each invocation.

I'd also be concerned that an example like

RegExp.tag`
/foo/
`

would lead people to mistakenly believe that removing the tag is safe, which would not be the case for

RegExp.tag`
/
foo
/x
`

I'd also note that several other languages, including Perl, allow you to use alternate delimiters for regular expressions, including <> and '', so I don't think using `` as the delimiter on its own is too wild of a notion.

/foo/ today means RegExp literal. A tagged template is not a RegExp literal, it is just a string that contains RegExp pattern characters, not unlike new RegExp(""). RegExp literals have the downside that you must escape / to use it in a pattern. A tagged template will already require that you escape ` to use it in a pattern. Requiring you also escape / as well would not be ideal.

ljharb commented 1 year ago

To clarify, I think readability necessitates maximal consistency with regex literals, as well as the RegExp constructor, both of which have flags last.

bathos commented 1 year ago

I find the composability argument very compelling; it would address real pain points I’ve encountered when trying to assemble grammars from common components. I also think flags-first is a readability improvement (and even a kind of hazard reduction), personally, but appreciate that the inconsistency would be unfortunate.

cben commented 1 year ago

[EDIT: this comment was a long rant I wrote before getting enough context, sorry. See my next comment instead.]

Is there a security argument for accepting [some] flags only out-of-bounds and not inside the regexp? [I am constantly getting snyk notifications about packages with "ReDOS" regexp denial-of-service vulnerabilities, so yes people use or build regexps using external inputs a lot.] * It seems to me `g`, `y` and maybe `d` are unlike other flags, in that they modify the actions performed with the regexp, rather than what the regular expression means. Enabling/disabling these against expectations of the code operating on a RegExp can lead to crashes or bugs. For other flags modifying the regexp syntax, a scoped flags syntax e.g. `(?i:subpattern)` syntax sounds safer to me. ~~Not sure if any regexp engines support such syntax; the composable-with-flags prior arts I'm familiar with are inherently structured LISP notations, not flat strings.~~ [EDIT: https://github.com/tc39/proposal-regexp-modifiers introduces such scoped identifiers] But a global `(?i)` is risky if people may interpolate a sub-regexp containing such flag into a larger regexp. [Some will continue doing naive interpolation whether you provide safer composable APIs or not.] If the result is `external (?i)internal external`, all interpretations are risky! Will it affect the whole thing? Will it be ignored for the internal portion? Will it raise an exception? An exception if (?flags) appear >0 position is safest and follows "principle of least surprise", but note that doesn't prevent bugs when interpolating a sub-regexp right at the start e.g. `${dir}/${fileName}`! The above is a strawman attack if the proposal is to support a leading LITERAL flags only in `` RegExp.tag`(?flags)...` `` and NOT in general regexp syntax. By the power of tagged literals, this would know the difference from `` RegExp.tag`${dir}/${fileName}` `` and if `dir` is a RegExp with its own flags it would be able to safely scope them to only that portion :sparkles: But giving RegExp special composition powers begs the question of external representation / round-tripping. If you can have case-insensitivity `i` only in a subpattern, what would .toString() return on the compound regexp? Would it convert each char `x` to `[xX]`? OK. What about `m`, `s` and `u`? Can they all be sanely reduced to a single canonical flag set where all regexp features are expressible (and at a reasonable length)?? Even if yes, it feels to me that translating tagged-literal composition to a composition-friendly scoped flags syntax like `(?i:...)` would be much more understandable to humans than an `[xX][yY]` soup. ~~And IMHO introducing a limited global `(?i)` syntax, even if it has Perl precendent, would be a missed oportunity.~~ [EDIT: I missed the whole context here] [I'm a random JS programmer, no status on any committee]
cben commented 1 year ago

Oops I should have read more before writing so much :flushed: I'll hide my previous comment.
Oh I see https://github.com/tc39/proposal-regexp-modifiers proposes a scoped syntax!

What's the benefit you see for RegExp.tag`(?x)...` over `RegExp.tag`(?x:...)`?

For the example given IMHO this works just as well:

RegExp.tag`/(?x
  # escaped, case sensitive
  ${string} 

  # nested RegExp is *not* escaped.
  ${caseSensitive ? /Z/ : /Z/i} 

  # And same style allows for locally scoped flags without need of nested RegExps!
  # e.g. if that case-insensitivity wasn't dynamic, you can just write:
  (?i:Z)
)/`;

IIUC the only use case for leading global syntax is expressing behavior tags (?gdy) where scoping doesn't make sense, and for which RegExp.tag needs some way to express. And that would allow dropping the slashes.

Does the current proposal allow RegExp.tag`/regexp/with/slashes/g` or do the inner slashes have to be escaped? If they must be escaped, I have to say the argument to drop slashes and pass global flags somehow else becomes compelling.
Especially since / vs. \/ is not meaningful at the layer of tagged literals, you need \\/ to get \/ in the template. So unlike /regexp\/with\/slashes/ you'd have to write ` RegExp.tag`/regexp\\/with\\/slashes/g`. So if that's necessary, "familiarity" is undermined.

Hmm, that escaping issue is much wider. All distinctions regexps make between metacharacters like + and literals like \+ need double backslash for the backslash to get to the template function :cry:! Am I missing anything?

>> console.log`foo+bar`
Array [ "foo+bar" ]

>> console.log`foo\+bar`
Array [ "foo+bar" ]

>> console.log`foo\\+bar`
Array [ "foo\\+bar" ]

On one hand, this means \\/ problem is not special.
OTOH, it means any attempt to explain it as "usual /regexp/ syntax wrapped with RegExp.tag`...`" promotes a wrong mental model! :warning: Because the central benefit of builtin /.../ syntax is backslashes inside have the regexp meaning directly, with no double-escaping nonsense.

It's much better to think of these as similar to new RegExp("...") with explicit awareness that it's a JS string syntax, with JS delimiters & escape processing, then undergoing regexp parsing.

OK, I'm sold. Ignore my previous objections to (?i)..., I don't care strongly if you do that RegExp.tag('i')...`, just don't mislead people with slashes.

ljharb commented 11 months ago

Closing, since the non-tag function is what advanced to stage 2 today.