slevithan / regex

Context-aware regex template tag with advanced features and best practices built-in
MIT License
271 stars 4 forks source link

Operations to avoid `.lastIndex` hacks? #5

Open rauschma opened 1 week ago

rauschma commented 1 week ago

Are there any ideas for operations so that we don’t need .lastIndex hacks anymore (I use them to tokenize efficiently)?

Maybe (with better names...):

slevithan commented 1 week ago

This library's implementation also liberally uses manual lastIndex adjustments. 😊

Do you envision these methods being top-level exports? I'm very open to discussion, but right now regex is exclusively focused on being a better regex constructor, and doesn't introduce utility functions.

The XRegExp library (which regex is a more focused and lightweight spiritual successor to) includes optional pos and sticky arguments for the XRegExp.test and XRegExp.exec methods, for just this purpose. (Its sticky argument predates ES6 adding flag /y, but uses /y when available.) And the hackiness of lastIndex is something I've extensively ranted about, e.g. here (from 2010). So I'm very interested in your ideas, but I'm not sure if it's the right fit for this library.

slevithan commented 1 week ago

Your implied proposal for new JS String.prototype.matchAt and RegExp.prototype.testAt methods is a cool idea, since a pos argument probably can't be added to the existing match and test methods due to its overlap with lastIndex handling. By creating new methods, they could always ignore lastIndex, like XRegExp's corresponding methods do. A couple thoughts though:

rauschma commented 1 week ago

Good points!

Below is my preliminary implementation. execAt() adds a property .nextIndex (as you suggested, but a different name). I’ll update it if anyone comes up with improvements.

// Updated: 2024-07-11
function execAt(regExp, str, index) {
  setLastIndex(regExp, index);
  return regExp.exec(str);
}
function testAt(regExp, str, index) {
  setLastIndex(regExp, index);
  return regExp.test(str);
}

function setLastIndex(regExp, lastIndex) {
  if (!regExp.sticky && !regExp.global) {
    throw new TypeError(
      `Either flag /g or flag /y must be set for matching at (or after) a particular index: ${regExp}`
    );
  }
  regExp.lastIndex = lastIndex;
}
slevithan commented 1 week ago
rauschma commented 1 week ago

Your implementation does not add + 1 to nextIndex after zero-length matches

I defer to your expertise:

I'm not understanding the purpose of regExp.lastIndex = savedLastIndex at the end.

I’d like to pretend that RegExp doesn’t store any state. I’d expect an actual built-in method to completely bypass .lastIndex and to be usable even with frozen RegExps.

There are two predecents:

My approach feels the cleanest to me but maybe we should follow the precedent of .replaceAll().

There are significant downsides and no user benefits I can think of from throwing if flag /g or /y is not set.

I see two options:

  1. Let /y dictate if matching is sticky or not. Then flags matter and, IMO, we should complain if they are not right.
    • That is similar to how .matchAll() and .replaceAll() work.
    • It makes the polyfill a tiny bit more efficient. But that shouldn’t really influence the design – if we expect this operation to eventually be built in.
  2. Provide non-sticky and sticky versions – e.g.:
    • execAt and execAtSticky (etc.)
    • execAtOrLater and execAt (etc.)

In case # 2, we can ignore /g and /y and use your approach.

(I have updated the code, but not yet switched to # 2.)

slevithan commented 6 days ago

I made nextIndexAfterLength an option. Then some people could use (e.g.) null. Not sure if that is useful though.

IMO it's probably overly complicated. One way to remove the complexity would be to go further than removing this and just not provide any kind of nextIndex property. It's not needed since its derivable, and it creates questions without obvious answers, like this one.

I’d like to pretend that RegExp doesn’t store any state. I’d expect an actual built-in method to completely bypass .lastIndex and to be usable even with frozen RegExps.

IMO it's better to maintain the state (if not frozen) and all the expectations from precedent, for the sake of all the legacy methods that rely on lastIndex (most of which are not actually legacy), but then abstract away the need to interact with these bits, via new methods and arguments. So bypass lastIndex for handling, but still set it based on precedent whenever possible (with the obvious precedents being the behavior of exec and test).

There are two predecents:

  • .matchAll() considers .lastIndex but doesn’t change it. (I’d rather it didn’t consider .lastIndex).
  • .replaceAll() ignores .lastIndex and resets it to 0.

I'd be curious to hear what TC39 thinks, but I think these might not be the right precedents.

I agree with you on wishing that matchAll didn't consider lastIndex, since the precedent from other string methods (as opposed to RegExp methods, and I consider non-global match to effectively also not be a string method since it is an alias for the canonical RegExp method exec) is that lastIndex doesn't set the search start position. But, perhaps, it can be justified, since matchAll essentially just repeatedly calls the RegExp method exec, and RegExp methods do consider lastIndex.

[...] Then flags matter and, IMO, we should complain if they are not right.

matchAll and replaceAll throw for regexes without /g, but that is because of the "All" in their names. I don't see anything incorrect about allowing both /g and non-/g with execAt/testAt, since exec/test already allow /g and non-/g (as do string methods search/split). I think the only impact of /g with execAt/testAt should be on whether lastIndex is updated. And I don't think separate sticky versions are needed. IMO /y should be respected with execAt/testAt, just like it's respected (or ignored if irrelevant, as with split) by every other method.


Aside: My high-level view is that /g, /y, and lastIndex were all API design mistakes. Flag /g would have been better handled as an argument (and now, after the introduction of replaceAll, even such an argument is not needed anywhere, if match didn't have a non-/g mode that was an alias for exec, which again I think was a mistake). lastIndex would have been better as a pos argument for exec and test, or handled via separate execAt/testAt methods that we're now discussing. And instead of /y, it would have been preferable to follow the prior art of the more flexible \G assertion from many other regex flavors, rather than standardizing Firefox v3's (at the time) non-standard flag.

ES3's flag /g copied Perl, but didn't copy Perl's system for tracking position on the search target rather than on the regex. Some other flavors don't use /g, and instead accept an argument for pos. JS's features for this can be worked with/around, but it's needlessly complicated, and (as you obviously know!) stateful regexes are a frequent source of bugs and surprise.

slevithan commented 3 days ago

Another aside:

There are a few different approaches that could be used to simplify, clean up, and improve JS's regex APIs. And I'll encourage and be happy to help anyone who wants to build their own version, based on whatever they think is best. Especially if it's someone like the awesome @rauschma who is well informed by all the existing complexity and legacy! 😊 Unfortunately, I don't think many people will use it if it's not native (added through new TC39 proposals), since JS's native APIs are good enough, and most people's usage of regexes is very simple.

The idea that there's not much appetite for non-native solutions to this stems from my experience with XRegExp. XRegExp already provided the APIs to totally disregard lastIndex, /g, and /y (making regexes act as if they were stateless), via its own XRegExp.exec / test / forEach / match / replace / replaceEach / split. But from what I can tell, almost no one used their pos, sticky, and scope arguments. So I didn't have much incentive to further improve them.

rauschma commented 3 days ago

IMO it's probably overly complicated. One way to remove the complexity would be to go further than removing this and just not provide any kind of nextIndex property. It's not needed since its derivable, and it creates questions without obvious answers, like this one.

Good point.

IMO it's better to maintain the state (if not frozen) and all the expectations from precedent, for the sake of all the legacy methods that rely on lastIndex (most of which are not actually legacy), but then abstract away the need to interact with these bits, via new methods and arguments. So bypass lastIndex for handling, but still set it based on precedent whenever possible (with the obvious precedents being the behavior of exec and test).

Makes sense.

I don't see anything incorrect about allowing both /g and non-/g with execAt/testAt

I would argue:

Aside: My high-level view is that /g, /y, and lastIndex were all API design mistakes.

I very much agree! I used \G via Rust’s fancy-regex and it was lovely.

I’m wondering if the upcoming flag /x would give us the opportunity to overhaul RegExp. My thoughts (based on your ideas) are here: https://gist.github.com/rauschma/8db338a96f9ead8df0714d233e81fed4

slevithan commented 3 days ago

I would argue:

  • If we stay close to the existing API (as we did with .lastIndex) then it makes more sense not to work around the constraints imposed by /g and /y.

This is a good explanation for your PoV--thanks! I think I still favor the flag-agnostic approach I described, but this helps me understand where you're coming from, and it's a totally reasonable perspective.

I’m wondering if the upcoming flag /x would give us the opportunity to overhaul RegExp. My thoughts (based on your ideas) are here: https://gist.github.com/rauschma/8db338a96f9ead8df0714d233e81fed4

You're very generous to credit me for inspiring this. This is a very ambitious proposal that essentially deprecates a whole swath of things (including recently added features like /d and matchAll) and introduces a lot of features simultaneously. I think I love it, but I'm curious how receptive TC39 would be. I think you'd have to make your case as strong as can be, and I think they'd want you to split it up regardless (I'll add some thoughts about this on this on the gist). If you get traction, I'd be happy to co-champion the proposal if you want help. 😊

A couple things that might be relevant/helpful: