tc39 / proposal-regexp-x-mode

BSD 3-Clause "New" or "Revised" License
24 stars 2 forks source link

Add helper method for multi-line regular expressions #3

Closed theScottyJam closed 2 years ago

theScottyJam commented 2 years ago

The idea of having an X flag for regular expressions also happened to be discussed on the TC39 forms over here. One of the ideas was to include a new Regex.from utility function with the proposal, that would make the creation of multi-line regular expressions easier.

Regex.from would actually be a function that takes flags as a parameter, and returns another function that can be used as a template tag, as follows:

const re = Regex.from('x')`
    # match ASCII alpha-numerics
    [a-zA-Z0-9]
`;

Thoughts?

h-h-h-h commented 2 years ago

This article from 2016 mentions three String.raw limitations. The first seems to be fixed by now, but I could still reproduce the others:

  1. There is no easy way to include the character itself in the string `.

[...]

console.log(String.raw`с:\John`s.js`);

[...]

  1. There is no easy way to create a backslash string at the very end.

[...]

console.log(String.raw`с:\`);

Or, more paradoxically (first \ allowed, second not):

console.log(String.raw`с:\foo\`);

And how would the returned function of RegExp.from() handle interpolation with ${}, which even String.raw handles? Actually, embedding content in a regex would come handy. But the content would need to be regex-ready (think quote() function), which could theoretically be handled in the template function that RegExp.from() returns.

These limitations and maybe more would always have to be considered when writing regexes, increasing cognitive load. However, I don't think there's a way around this.

In Markdown, you encase the code text in more backticks and spaces: ``` `` `foo` `` ```. In Rust, you use more # characters: r##"foo #"# bar"## yields foo #"# bar. Getting language-level special treatment to be able to achieve actual raw strings would be great.

theScottyJam commented 2 years ago

To be honest, I don't feel like it's that important to support more sophisticated escaping with template literals. If there's not a ton of backslashes in your string, then don't use String.raw, use a normal template tag. If there is, use String.raw. If you're in one of those rarer scenarios where even String.raw is causing issues, use string concatenation, combining raw strings with unraw strings. Yes, string concatenation isn't the prettiest, but it's a relatively small price to pay in relation to how often these edge cases come up.

Focusing our attention more on the specific Regex.from() tagged template literal, you are bringing up some interesting thoughts.

First of which is, should the Regex.from() template tag treat the template literals as raw strings or normal, escapable strings. I'm inclined to say it should always treat the template contents as a raw string, considering how common it is to use a backslash in regular expressions, and how nasty it would be to have to double-back-slash every time you want to use one.

Second, should we provide special quoting behaviors when values are interpolated into the template? Originally, I just assumed we'd make interpolation behave like normal string concatenation, the same way it does with an untagged template literal, but the idea of auto-quoting the value being interpolated is interesting. I guess we'd need to ask ourselves what kinds of quoting behaviors should be expected when you interpolate in different locations. Here's a handful I can think of.

The one hesitation I have with this, is that it'll make it impossible to construct larger regular expressions by combining strings containing a chunk of the regular expression, because interpolating the string would cause everything in it to get escaped. And there's also the issue that it's not actually intuitive how the quoting behavior should work, as seen in my second point. One might expect quoting to just prevent you from leaking out of a character class with a ] character, and another might expect all characters to be interpreted as those literal characters. It's never good when there's ambigouse expectations in these sorts of things, as it'll highten the learning curve needed to use it. And, there's also the fact that you could replicate all the useful behavior of escaping by writing a simple escape function like this:

const escape = string => [...string].map(char => '\\' + char).join('');

const dynamicCharacterClass = '`^$\\';
const pattern = new RegExpr(`xx[${escape(dynamicCharacterClass)}]*xx`);
'xx$^$^\\xx'.match(pattern); // matches

But still, it's certainly an idea worth discussing.

Third, is there a way to handle any kind of string literal, including trailing backslashes or back ticks? Well, first, I believe a trailing backslash in a regular expression is always going to be a syntax error, so that's actually a moot point. And back-ticks can simply be handled by putting a \ before it, which might not be the most perfect solution out there, but it's probably good enough. If it does become too much of a problem, the solution would be to cook up an entirely different proposal that provides new template-literal syntax, that lets us put back-ticks inside a template literal without escaping them, and once that proposal goes through, this multi-line regex proposal will automatically benefit from it.

h-h-h-h commented 2 years ago

I find it annoying that we don't seem to be able to get a proper once-and-for-all solution like Rust's raw strings. So, we always have to think hard about edge cases, of which we might not find all. But you're right:


Are you sure all these context-aware handlings of interpolations would even be possible? It sounds good, but if there's no way out, it'd indeed be doubtful that it would be widely accepted without hesitation. We could have both ways (with and without auto-escaping) by either providing two template functions (a second besides from()) or having an argument(/regex flag?) that specifies the behavior. Although one could always go the way of new RegExp(String.raw`^my` + String.raw`regex$`).

theScottyJam commented 2 years ago

I find it annoying that we don't seem to be able to get a proper once-and-for-all solution like Rust's raw strings

I do agree it's a little awkward to work around some of these edge cases, when we're wanting to make the template tag use raw strings by default. But, I'm also just hesitant to introduce a fourth way to construct strings in JavaScript to avoid these issues. Though, you know what, it might be worth starting a conversation on the TC39 forms about ways to deal with the edge cases of string escaping. Normally, using a combination of raw strings and normal strings would cover all use cases, but when it comes to tagged template literals, you're much more limited in what you can do escaping-wise, which is a shame. Just a thought.

Even though String.raw\`` === "\\", console.log(new RegExp(String.raw\``).test("")); logs true, because ` is an escape handled by the regex engine.

Oh... you're right. I had just assumed that if the backslash was being used to escape a back-tick, it would also get removed from the raw string. This behavior surprises me. But, yeah, I guess it would still work, but it might not for other tagged-template contexts. Weird...

An exception would be something like:

  • RegExp.from()\\ # Match \

Yes, it's true that the ending \ could be part of a comment, and would therefore be valid. Though, in practice, if you're going to be using comments in regular expressions, it's probably because you're also sprawling it across multiple lines and you want to comment an individual line, so I don't feel like we're losing any functionality here.

Are you sure all these context-aware handlings of interpolations would even be possible?

If not, it's an issue with tagged-template literals themselves, and should be fixed on the language level, not with this specific proposal. Though, I would mention a couple of other workaround that will always work:

h-h-h-h commented 2 years ago

it might be worth starting a conversation on the TC39 forms about ways to deal with the edge cases of string escaping.

I wouldn't participate, though.

That was all I wanted to add to the discussion.

rbuckton commented 2 years ago

There is already an existing proposal for RegExp escaping that includes a RegExp.tag function, so I don't believe its necessary to duplicate that effort in this proposal.