slevithan / regex

Context-aware regex template tag with advanced features and best practices built-in
MIT License
272 stars 4 forks source link

Clarify the complete absence of numbered capturing groups with the n flag #1

Closed benblank closed 1 month ago

benblank commented 1 month ago

I quite like the n flag, but not having previous experience with it, I still had questions after reading that section of the README which I was ultimately only able to answer by reading the source. Specifically, I was unclear on whether numbered / unnamed capturing groups were simply no longer possible or could still be created via some other syntax.

I'm probably picking at nits here, but something about how that section is phrased left me wondering whether it was implying that numbered capturing groups could be created in some other way, even as a fairly experienced user of regular expressions. 🙂

Changing the name of the flag to something like "no numbered capture" might be more clear, but adding another implementation-specific name for the flag might not be great. Tweaking the description could also clear things up considerably, with or without a name change. Perhaps even just changing the first sentence to something like:

Flag n gives you no auto capture mode, which disables numbered capturing groups (so a plain (…) group is always non-capturing) but preserves named capture.

And thanks for creating this library! It looks like it adds some very helpful features and safe defaults to a regex engine which could use them.

slevithan commented 1 month ago

Thanks!! I will definitely try to improve that section in the readme based on your feedback here.

A few comments:

(After all, there's no way to create a numbered capturing group, so they can't refer to anything, anyway.)

That's not true in JavaScript, although it is true in C++. If Regex.make only changed (…) to (?:…) but didn't additionally impose the restriction of preventing numbered backreferences to named groups, then something like (?<name>…)\1 would be valid. In JavaScript, the backreference numbers go left to right for all captures (named and unnamed), but as called out in the readme, the way that named captures are numbered varies across regex flavors.

numbered backreferences are entirely disallowed

They're allowed in one context: within interpolated regexes. And there is specific handling to make sure they work, by adjusting numbered backreferences to work within the overall pattern (which requires accounting for both named and unnamed captures since they both affect the numbering).

Changing the name of the flag to something like "no numbered capture" might be more clear, but adding another implementation-specific name for the flag might not be great.

That's a good suggestion. Another option I considered is "named capture only". I'll think about it, and I'm open to other suggestions. I'm okay with using a unique name for Regex.make if it meaningfully improves clarity. Two potential downsides to "no numbered capture" are (1) it might imply (…) doesn't work (but it does, it just doesn't capture), and (2) there still are numbers for named captures, you just can't use them within Regex.make. But you'll find them e.g. on match result arrays, although obviously Regex.make is encouraging you to ignore them in favor of the groups object.

benblank commented 1 month ago

I hadn't thought about interpolated regexes; wouldn't the accompanying backreferences also need to be in the interpolated regex. In other words, wouldn't numbered backreferences still "entirely disallowed" in the outer regex?

If you do decide to change the name, I think you've nailed it with "named capture only". Very simple and explicit about what is and isn't possible.

Okay, so if I have it right…

…with all but the first two basically being "non-effects" caused by either the sandboxing of interpolated RegExps or the fact that make() returns an ordinary RegExp.

Whew! Who would have thought regular expressions could be complex? 😉

slevithan commented 1 month ago

Yes, all of your details are correct and well stated. 🙂

Whew! Who would have thought regular expressions could be complex? 😉

OMG, this is not even the start of it. Recall where the readme describes four different parsing modes (which Regex.make reduces down to just one). And the work required to reliably sandbox and atomize any value for all three supported forms of interpolation in all regex syntax contexts while still outputting native regexes is kind of insane. You can get a sense of it by reading the readme top to bottom and expanding all the collapsed sections, but the text doesn't try to capture everything. (Thanks for the opportunity to rant.)

slevithan commented 1 month ago

@benblank A question while you're here:

Would it make a meaningful difference to you to shorten Regex.make`.` to just Regex`.`? This is a serious question for me, since although Regex.make`\w` is already dramatically shorter than new RegExp(String.raw`\w`, 'v') for dynamic regexes, it's longer than /\w/v for literals. My hope is that at least some people will start using Regex.make for all of their regexes (to benefit from its best practices, protections, and features); not just their most complex.

A downside to putting the tag on Regex rather than Regex.make is it makes it harder to import just the parts you want, should there be more exports in the future than the current make and partial.

benblank commented 1 month ago

(Thanks for the opportunity to rant.)

I've been following your blog for years largely because of what you have to say about regular expressions, so no worries! 👍

With regards to Regex.make, I personally tend to avoid default exports, so wouldn't get any benefit from Regex`.`; I'd still be importing make.

The main reason I prefer named exports isn't even related to tree shaking — it's the name itself. A named export gives you the opportunity to "suggest" a local name to use for the imported value. The author of the client code can always choose to ignore the suggested name using import { foo as bar } … syntax, but the conventional name is at least established in the code. For default exports, lacking that suggested name in the code, the convention tends to only be in the documentation.

From that perspective, I think I'd not only leave the tag where it is, but even rename/alias make to regex, simply because it's more clear. Just as partial creates "a partial", regex would create "a regex" (even though the former is a library class and the latter built-in). Right now, the name make is generic enough that if I were using just the named export in client code, I'd be worried about its meaning being unclear, but regex doesn't have that problem.

Besides, even though import { regex } from 'regex' is slightly longer than import Regex from 'regex', you only type it once per file. And regex`\w` is even shorter than Regex.make`\w`! 😎

slevithan commented 1 month ago

Aliasing make as regex is a cool idea. I'll give it some time to think about it. Thanks for the detailed thoughts!

slevithan commented 1 month ago

Okay, I've (hopefully) improved the description of flag n (and renamed it as named capture only mode) while trying to keep it concise. Feel free to suggest additional improvements!

I've adopted your idea of renaming make as regex, and kept make as an alias (that might be removed down the line in v2). Following the implications of this change, I've also renamed the overall library from Regex.make to just regex.

Thanks, @benblank! 😊

slevithan commented 1 month ago

Published as v1.1.0.