golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.06k stars 17.68k forks source link

proposal: regexp: add iterator forms of matching methods #61902

Open rsc opened 1 year ago

rsc commented 1 year ago

We propose to add methods to regexp that allow iterating over matches instead of having to accumulate all the matches into a slice.

This is one of a collection of proposals updating the standard library for the new 'range over function' feature (#61405). It would only be accepted if that proposal is accepted. See #61897 for a list of related proposals.

Regexp has a lot of methods that return slices of all matches (the “FindAll*” methods). Each should have an iterator equivalent that doesn’t build the slice. They can be named by removing the “Find” prefix. The docs would change as follows. (Plain text is unchanged; strikethrough is removed, bold is added):

There are 16 24 methods of Regexp that match a regular expression and identify the matched text. Their names are matched by this regular expression:

(Find|All|FindAll)?(String)?(Submatch)?(Index)?

If 'All' is present, the routine matches successive non-overlapping matches of the entire expression. The ‘Find’ form returns the first match. The ‘All’ form returns an iterator over all matches. Empty matches abutting a preceding match are ignored. The For ‘FindAll’, the return value is a slice containing the successive return values of the corresponding non-’All’ non-‘Find’ routine. These The ‘FindAll’ routines take an extra integer argument, ...

Instead of enumerating all eight methods here, let’s just show one example. FindAllString currently reads:

// FindAllString is the 'All' version of FindString; it returns a slice of all // successive matches of the expression, as defined by the 'All' description in // the package comment. A return value of nil indicates no match. func (re *Regexp) FindAllString(s string, n int) []string

This would change to become a pair of methods:

// FindAllString is the 'All' 'FindAll' version of FindString; it returns a slice of all // successive matches of the expression, as defined by the 'All' 'FindAll' description in // the package comment. A return value of nil indicates no match. func (re *Regexp) FindAllString(s string, n int) []string

*// AllString is the ‘All’ version of ‘FindString’; it returns an iterator over all // successive matches of the expression, as defined by the ‘All’ description in // the package comment. `func (re Regexp) AllString(s string) iter.Seq[[]string]`**

The full list is:

// All is the ‘All’ version of ‘Find’: it returns an iterator over all ... func (re *Regexp) All(b []byte) iter.Seq[[]byte]

// AllIndex is the ‘All’ version of ‘FindIndex’: it returns an iterator over all ... func (re *Regexp) AllIndex(b []byte) iter.Seq[[]int]

// AllString is the ‘All’ version of ‘FindString’: it returns an iterator over all ... func (re *Regexp) AllString(s string) iter.Seq[string]

// AllStringIndex is the ‘All’ version of ‘FindStringIndex’: it returns an iterator over all ... func (re *Regexp) AllStringIndex(s string) iter.Seq[[]int]

// AllStringSubmatch is the ‘All’ version of ‘FindStringSubmatch’: it returns an iterator ... func (re *Regexp) AllStringSubmatch(s string) iter.Seq[[]string]

// AllStringSubmatchIndex is the ‘All’ version of ‘FindStringSubmatchIndex’: it returns ... func (re *Regexp) AllStringSubmatchIndex(s string) iter.Seq[[]int]

// AllSubmatch is the ‘All’ version of ‘FindSubmatch’: it returns an iterator over all ... func (re *Regexp) AllSubmatch(b []byte) iter.Seq[[][]byte]

// AllSubmatchIndex is the ‘All’ version of ‘FindSubmatchIndex’: it returns an iterator ... func (re *Regexp) AllSubmatchIndex(b []byte) iter.Seq[[]int]

There would also be a new SplitSeq method alongside regexp.Regexp.Split, completing the analogy with strings.Split and strings.SplitSeq.

// SplitSeq returns an iterator over substrings of s separated by the expression. func (re *Regexp) SplitSeq(s string) iter.Seq[string]

cespare commented 1 year ago

@rsc under "The full list is" the actual function signatures need to be fixed to match the correct names in the doc comments.

magical commented 1 year ago

It seems like you are trying to establish a general pattern in the stdlib that the word All in a method means that it returns an iterator. However, the regexp package has already established a pattern that All means a method that returns a slice. I don't think that dropping the Find prefix is a particularly good or memorable compromise, and that it would be better in this case to find a different word to replace All (say, Iter).

Which is to say, I think these names would be a lot clearer:

    func (re *Regexp) FindIter(b []byte) iter.Seq[[]byte]
    func (re *Regexp) FindIterIndex(b []byte) iter.Seq[[]int]
    func (re *Regexp) FindIterString(s string) iter.Seq[string]
    func (re *Regexp) FindIterStringIndex(s string) iter.Seq[[]int]
    func (re *Regexp) FindIterStringSubmatch(s string) iter.Seq[[]string]
    func (re *Regexp) FindIterStringSubmatchIndex(s string) iter.Seq[[]int]
    func (re *Regexp) FindIterSubmatch(b []byte) iter.Seq[[][]byte]
    func (re *Regexp) FindIterSubmatchIndex(b []byte) iter.Seq[[]int]
Merovius commented 1 year ago

@magical I think Iter as a name component reads very badly. I'd be fine with it in regexp (given how rarely it's used), but definitely not as a general convention. I'd prefer essentially anything over changing the All convention to an Iter convention, including accepting this slight ambiguity in the regexp package, deviating from the All convention in the regexp package only, or not adding iterator-versions of methods to regexp at all. In particular given how rarely used the regexp package is in my experience.

rsc commented 1 year ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

benhoyt commented 1 year ago

I have found it quite confusing in the past to figure out which of the 16 Find* methods to use for a particular task. Determining String vs non-string is quite straight-forward, but I always have to re-read several paragraph of docs and play around to determine which one of the remaining 8 to use. It's cute, even elegant, that the method names are matched by a regex (Find(All)?(String)?(Submatch)?(Index)?), but in my experience that doesn't make it easier to choose which one you need. This proposal is going to increase that cognitive burden. I don't have a concrete suggestion, but I wonder if there's a better way. From this perspective I do think having Iter (or Seq?) in the name could help. Or maybe there needs to be a "regex method chooser wizard" in the docs? Only semi-serious, but maybe this can be solved by documentation improvements.

extemporalgenome commented 1 year ago

In addition to the correction @cespare proposes, I believe your AllString example has the wrong returned iterator element type ([]string). It should be string:

func (re *Regexp) AllString(s string) iter.Seq[string]
extemporalgenome commented 1 year ago

I believe it would be beneficial to document in the relevant methods that any yielded slices ([]byte, []int, [][]byte, []string) cannot be retained beyond the current iteration step, thus allowing the regex implementation to reuse these slices. There is precedent for this kind of restriction in bytes.Buffer.Bytes and elsewhere.

I suspect most use-cases will be unaffected by this restriction, other than to benefit from reduced allocations (which could potentially further benefit from internal use of sync.Pool to share applications between unrelated iterations). Those cases that do need to retain the slice can easily enough call slices.Clone or append.

go vet could include a check to ensure these slices do not escape in most cases.

leaxoy commented 1 year ago

Emm, that is, we will have nearly 60 methods to do matching? 😂

How about add a new type RegexpIter? And provide methods convert between the two.

extemporalgenome commented 1 year ago

I agree with @leaxoy.

Irrespective of my above comments, I believe we'll be better served with an approach closer to bufio.Scanner, in which something like a *Match value, with methods for fetching index offset ranges, individual matches as a string or byte slice, etc.

Merovius commented 1 year ago

I don't believe a separate type is a good idea. The regexp.Regexp type is a representation of a compiled regular expression, not of how to extract its values. We don't have regexp.BytesRegexp and regexp.StringRegexp, for example. And the API surface is larger with separate types, rather than smaller. And 99% of the logic will be shared (in fact, all the other extraction methods will probably be implemented in terms of the iterator methods).

It's unfortunate that the *regexp.Regexp type has that many methods. But I don't think deviating from that design decision at this point will be an improvement. It just means the result will be even more confusing, because now it is also incongruent and asymmetric.

mrg0lden commented 1 year ago

Regarding the index variants, wouldn't make sense to use a Seq2[int,string|[]byte] instead of having two variants? That would be similar to slices' convention.

extemporalgenome commented 1 year ago

With separate methods, particularly the Index variants, I am concerned that we'd be adding the methods for completeness without gaining much value.

FindReaderIndex is of questionable utility, given that when you have a reader that you know nothing about (or know that you can't or don't want to reread, such as io.Stdin), it's rare that you can make meaningful use of the indices alone.

There are other cases where having a separate Match type is simply more efficient: the consumer may want a string representation of some submatches while having a byte slice representation of others. With a Match type, the consumer does not need to care what the underlying input was. If a string for a particular submatch is requested, it'll slice or copy a portion the underlying input data depending on whether or not it was a string, but will be no worse in allocation efficiency compared to what the caller needs to do today.

extemporalgenome commented 1 year ago

@Merovius regarding deviating from the current convention being confusing, I think that's entirely manageable if all of the iter methods are internally self-consistent. Per your point, if we implement the methods as Russ initially proposed and then introduce a separate type, that certainly will be confusing. So this is really the only good time we'll get to make a clean decision.

Merovius commented 1 year ago

@extemporalgenome We seem to be talking past each other. Putting the method on a new type is the deviation. Obviously that can't be addressed by making them "internally self-consistent". And it's also not about timing - it's a deviation if we do it from the beginning just as much.

If we never had the *regexp.Regexp type with it's current slew of variations, then maybe we could talk about making the API more minimal. But that ship sailed over a decade ago. At this point, we have one *regexp.Regexp type and it has methods based on a regular schema that are grouped by type they return. Adding iter.Seq as a type to return on equal footing with []byte and string is conforming to that - again, established a decade ago - pattern.

doggedOwl commented 1 year ago

Is there really a need for these? do the regexp find methods return enough elements to justify returning an iterator or is this just to avoid that one slice allocation?

leaxoy commented 1 year ago

It's unfortunate that the *regexp.Regexp type has that many methods. But I don't think deviating from that design decision at this point will be an improvement. It just means the result will be even more confusing, because now it is also incongruent and asymmetric.

@Merovius a crazy idea, how about introduce regexp v2 like the math v2 to simplify API.

Merovius commented 1 year ago

@leaxoy It doesn't seem that wild to me. I think there is an argument to be made that a) we might want to wait a release or so to see how the math/rand/v2 migration works and b) if we intend to make a v2, we might as well add these in the meantime, as any harm would be time-limited.

But yeah, it's not really up to me. Personally, I think this proposal is fine as it is, but maybe the Go team can be persuaded to do a v2 for regexp as well (I'm not sure how a simplified API would look, though).

@doggedOwl I thought the same thing, TBH. Especially as the matching groups can be sliced out of the input, so just the actual result slice has to be allocated. There are still two ways in which an iterator form arguably might improve performance: 1. it might enable you to prematurely stop some matching work - though this seems to be possible in a corner case at best. And 2. it might enable you to do subgroup matching entirely without allocations.

But I'm not totally sold on needing an iterator form of these either. It would be possible to feed the iterator into other iterator transformation functions, but then again, that would be just as possible by using slices.All on the result.

ulikunitz commented 1 year ago

Unfortunately the proposal doesn't say, what the benefit of the proposal is or what problem it is trying to address. It looks more like a demo for iterator functions.

The proposal will further inflate the number of methods for the Regexp type. Already now I have to consult the documentation every time I use the package.

I would welcome a regexp2 package that simplifies the interface. Maybe by using byte slices and string as type arguments and supporting only the iterator methods, since the first match functions wouldn't be required anymore. Using a match type could also reduce the provided variants.

rsc commented 1 year ago

Unfortunately the proposal doesn't say, what the benefit of the proposal is or what problem it is trying to address.

It says

Regexp has a lot of methods that return slices of all matches (the “FindAll*” methods). Each should have an iterator equivalent that doesn’t build the slice.

The benefit is that it doesn't build the slice. If you are searching large texts, you almost always want to consider the matches one at a time, and building the slice of all results is wasted memory, potentially larger than the text.

rsc commented 1 year ago

Finishing this proposal discussion is blocked on https://github.com/golang/go/issues/61405.

deefdragon commented 6 months ago

This should be unblocked right? Or is it waiting on the iterators addition to roll out?

aclements commented 1 month ago

This is a lot of new methods, but it's also very regular and consistent with the existing API. Is there anything still blocking this proposal?