golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.84k stars 17.51k forks source link

proposal: Go 2: function values as iterators #43557

Closed mknyszek closed 1 year ago

mknyszek commented 3 years ago

Proposal: Function values as iterators

Motivation

A number of proposals have come up over the years of supporting some first-class notion of iterator such that loops that use the range keyword may be used to iterate over some custom data structure. The Go community in general has also wondered about the "right" way to describe an iterator, evidenced by this blog post from 2013 that describes many of the ways Go programmers abstract away iteration, and much of this is true today as well.

Overall, however, iteration over a non-builtin data structure may be somewhat error-prone and clunky. To refresh on the various ways that iteration may be abstracted away in Go, let's work through some examples.

Firstly, we may have some tree type with some dedicated iterator type:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() IntTreeIterator {
    ...
}

type IntTreeIterator struct {
    ...
}

func (t IntTreeIterator) Next() (int, bool) {
    ...
}

One way such an "iterator" could be used is,

iter := tree.Iter()
for {
    i, ok := iter.Next()
    if !ok {
        break
    }
    // Do stuff with i.
}

or even,

iter := tree.Iter()
for i, ok := iter.Next(); ok; i, ok = iter.Next() {
    // Do stuff with i.
}

The former usage works fine, but suffers from readability issues. The meaning of the code is obscured somewhat as the reader first sees an apparently infinite for-loop and must look for the "break" statement to understand what's going. Luckily this condition is usually present at the top of the loop, but it requires a more careful look. Furthermore, the iteration condition needs to be written explicitly. Writing it once may not be a problem, but writing it 100 times might be.

The latter usage also works fine and the intent is more clear, but it has a similar problem with the iteration condition. There's also an element of repetition which on the surface is fine, but it does harm the readability of the loop. Especially with variable names like "i" it becomes easy to get lost in punctuation.

Another way to abstract iteration away is to pass a function value to a function that iterates on behalf of the caller. For example:

type IntTree struct {
    ...
}

func (t *IntTree) ForEach(func(i int)) {
    ...
}

tree.ForEach(func(i int) {
    // Do stuff with i.
})

This method works well in many scenarios, but is decidedly less flexible as it separates the loop body from the surrounding code. Capturing local variables in that function value helps, but potentially at the cost of some efficiency, depending on the complexity of the iteration. One advantage of this method, though, is that defer may be used to perform clean up on each loop iteration, without allocating a defer (thanks to open-coded defers).

Prior art

A previous proposal (#40605) suggested allowing types with a Next method to have that method repeatedly called when used in a range loop:

iter := tree.Iter()
for i := range iter {
    // Do stuff with i.
}

This works fine, but from my perspective, doesn't feel very Go-like. Having a language construct be aware of a type's methods for the sake of syntactic sugar is not a pattern found anywhere else in the language (yet). In the parlance of the generic design, the existing range keyword matches on the underlying type used, not the type's methods.

Furthermore, it usually requires defining a new type which is a bit more work for the writer of the code as well as the reader. Overall a bit clunky, but not bad. It lines up well with how other languages work. Rust, for instance, uses a trait (analogous to an interface) to determine whether a type is an iterator.

Another previous proposal (#21855) suggested supporting a two-clause for-loop to make iterating less error-prone, such as:

iter := tree.Iter()
for i, ok := iter.Next(); ok {
    // Do stuff with i.
}

Unfortunately, a two-clause loop form is itself error-prone, as the placement of a single semicolon has a significant effect on semantics (specifically, the second semicolon which distinguishes between the 2-clause and 3-clause forms). This proposal was rejected because that semicolon was considered too dangerous.

Other proposals (#24282) have been made to fundamentally change how for loops work, indicating at least some degree of friction.

Proposal

Rolling with the idea of closures, and with the observation that the range form matches on types largely structurally, I would like to propose that range loops repeatedly apply function values of a particular form.

More specifically, I propose allowing the for-range statement to accept values of type func() (T, bool) or func() (T, S, bool) (where T and S are placeholder types; they may be any substituted for any other type) and will repeatedly call these values until their bool result is false.

Iterators may then be written in the following way:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() func() (int, bool) {
    ...
    curr := t.root
    return func() (int, bool) {
        // Use curr to keep track of your position, repeatedly find the successor of curr
        // until there is none.
    }
}

for i := range tree.Iter() {
    // Do stuff with i.
}

More precisely, the last for statement "de-sugars" into the following code, where tmp is a temporary variable not visible to the program:

tmp := tree.Iter()
for i, ok := tmp(); ok; i, ok = tmp()  {
    // Do stuff with i.
}

The limitation to variables of type func() (T, bool) or func() (T, S, bool) (instead of allowing an arbitrary number of return values) is to keep range loops looking familiar and to avoid a misuse of the syntax.

Discussion and observations

Pros:

Cons:

This idea was inspired by the way Python generators are used as iterators. I recently had the realization that Python generators can approximated by repeated application of closures. I thought that this would have been proposed already, but I couldn't find anything like it. Interestingly, this proposal also allows for iterating over infinite generators and such. It can also be used to write Python-like loops (for better or for worse):

// Linspace returns an iterator that iterates from start (inclusive)
// to stop (exclusive) by stride. start must be <= stop.
func Linspace(start, stop, stride int) func() (int, bool) {
    i := start
    return func() (int, bool) {
        i += stride
        return i, i < stop
    }
}

for i := range Linspace(0, 10, 2) {
    // i = 0, 2, 4, 6, 8
}

It might also be worth allowing the iterator's final return value to be an error instead of a bool, though I'm not totally sure how to surface that to the user.

I'm not even totally convinced myself that this is worth doing, since the benefit seems minor at best, but I figured I should at least put the idea out there.

Deleplace commented 3 years ago

Iiuc we can achieve something similar by returning a chan. I mean, same simplicity at the range clause, and similar moderate complexity of the iterator implementation. Writing to an unbuffered chan is like yielding in a continuation.

ianlancetaylor commented 3 years ago

@Deleplace Using a channel can work, but it has some difficulties.

  1. It means that there has to be a separate goroutine that writes to the channel. If the loop breaks or returns early, some mechanism is required to tell that goroutine to exit.
  2. Because each iteration requires context switching between the two goroutines, it's significantly slower. So much so that in practice people don't use this approach.
zikaeroh commented 3 years ago

One thing I'd consider to be missing would be a way to inform the iterator that the caller is "finished" with the iterator, i.e. if the range loop were exited early for some reason. I can see some iterators wanting some element of cleanup; for example, building up a type safe iterator for rows.Next() and needing to call Close() at some point. Right now the only cleanup appears to be the eventual GC of whatever the closure referred to. Otherwise, you basically have to build an object that satisfies some interface and then use it in two places, deal with calling it the right way, etc. If this isn't going to be some multi-method interface implementation (and just the function), maybe return two functions, or have the one function have a exiting bool parameter, but I don't see much benefit over having a well-known interface.

I'd also wonder if this would need to wait for generics, such that the definition of this pattern could be well-defined with a generic interface somewhere rather than being "well-known" (and in regards to the above cleanup, potentially have some extra method an iterator could provide to be called on loop exit).

smasher164 commented 3 years ago
for i := range tree.Iter() {
   // Do stuff with i.
}

This code example is the same for #40605 and the current proposal, the only difference being matching on an interface type as opposed to a function type.

You mention that matching on an interface's methods as opposed to a concrete type is not Go-like, but in the same vein, the error handling proposals thus far have treated error specially. Why couldn't the same approach be taken for a hypothetical OneClauseIterator and TwoClauseIterator?

ivila commented 3 years ago

I see many of the repositories provide iterator in other way, for example: func ForEachCell in "github.com/tealeg/xlsx"

// ForEachCell will call the provided CellVisitorFunc for each
// currently defined cell in the Row.  Optionally you may pass one or
// more CellVisitorOption to affect how ForEachCell operates.  For
// example you may wish to pass SkipEmptyCells to only visit cells
// which are populated.
func (r *Row) ForEachCell(cvf CellVisitorFunc, option ...CellVisitorOption) error {
    return r.cellStoreRow.ForEachCell(cvf, option...)
}

and call it by like

err = r.ForEachCell(func(c *Cell) error) {
        // do something here ...
}

and can also define a common error like ErrEarlyBreak to know what happen while iter through all data, just like ErrNil for telling people key not exist in redis

err = r.ForEachCell(func(c *Cell) error) {
        // do something here ...
        if someCondition {
                return ErrEarlyBreak
        }
}
mknyszek commented 3 years ago

@Deleplace What Ian said. I also don't foresee the compiler gaining the ability to optimize that down at any point. It gets quite hard, and the optimization becomes fairly fragile.

@zikaeroh It may be true that some iterators want an element of cleanup, but I'd argue most don't. To me this is a signal that whatever the code is going iterate over is fairly complex, and maybe there's a better abstraction for that. For instance, the ForEach example above could be good for this because the cleanup is more succinctly captured by ForEach. I'm also wary about hiding more calls to code. If there was some cleanup function, its call ends up somewhat hidden (moreso than the original repeated call already is, since at least the value is visible).

RE: not seeing a benefit over having an interface, I think I get that. You also make a good point about this pattern being surprising, especially if it's not backed by some formal definition. It's true that most other languages do this differently and that's perhaps what folks have come to expect.

@smasher164 In my defense, none of those proposals actually made it yet. :) But, that's a good point. You certainly could go that route instead. FWIW, the "Go-like" in the proposal is absolutely subjective. Interfaces are "Go-like" too.

To me, range is like an operator that acts on types of a certain structure. That structure is visible in the type via brackets ([]), the chan keyword, and the map keyword. I'm proposing adding the func keyword (in some limited way) to the mix, basically.

mknyszek commented 3 years ago

@ivila I mention that (or a simpler version of that) in the proposal. I think that pattern has its uses and isn't going to go away (we already have e.g. filepath.Walk that works similarly). I'm not sure yet that there's a nice way to pass errors up with what I've defined.

And maybe its limited use is a good reason not to support it. I do think there are still cases where the iterator I've proposed can be useful. It works nicely for iterating over common data structures (or some combination thereof), optimizes nicely in the common case, and it composes nicely with map and filter functions (though it gets harder for the compiler to optimize that, of course). That last bit about composition gets easier with generics, but you can still do it without.

bboreham commented 3 years ago

Could you say what is meant in the T, S case? I scanned the text several times looking for S; apologies if I missed it.

My guess is it’s “key, value” like when range-ing over a map.

mvdan commented 3 years ago

I don't have a strong feeling about this proposal, but I also have to say it's probably the nicest approach to native iterators I've seen.

Hidden behavior: similar to operator overloading, there's now one more thing that range could mean, and each call could be arbitrarily expensive (e.g. a series of HTTP requests).

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

Outsized impact: a whole language change for something that is arguably only a minor issue.

Out of all the downsides you list, I think this is the only worrying one. At the same time, it's a problem with any proposal to add iterators to the language, so I don't think it's a problem with this approach in particular.

mknyszek commented 3 years ago

Could you say what is meant in the T, S case? I scanned the text several times looking for S; apologies if I missed it.

My guess is it’s “key, value” like when range-ing over a map.

Yeah, sorry about that, the S kind of comes out of nowhere. It's just supposed to be another placeholder for a type (I thought K, V might be more confusing). I added a line around where I introduce the func() (T, bool) form to explain that these are placeholder types that may be substituted with any other type. (@bboreham)

DeedleFake commented 3 years ago

As proposed, this seems to allow the use of method values, meaning that it is actually essentially compatible with the original Next() method iterator proposal:

iter := tree.Iter()
for i := range iter.Next {
    // Do stuff with i.
}
tv42 commented 3 years ago

The word "closure" here is a red herring and an implementation detail of the example code. Closures (in Go terminology) are function literals, but there's no reason to enforce that, or to really even know. The word wanted here is "function value" (and method values are function values).

I would suggest retitling to "proposal: Go 2: function values as iterators"

tv42 commented 3 years ago

"More specifically, a for-range statement may accept variables" should say "More specifically, a for-range statement may accept values"

mknyszek commented 3 years ago

I don't have a strong feeling about this proposal, but I also have to say it's probably the nicest approach to native iterators I've seen.

Thanks! 😀

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

I thought about writing down that counter-point, but what stopped me was the thought that it could do an arbitrary amount of CPU work. But you raise a good point that that's possible today, via copies. I may go back and add that counter-point, then.

Out of all the downsides you list, I think this is the only worrying one. At the same time, it's a problem with any proposal to add iterators to the language, so I don't think it's a problem with this approach in particular.

Agreed.

prattmic commented 3 years ago

As a counter-point, ranges can already block for a long time (e.g. channel receive) or consume a surprising amount of CPU (e.g. if copying huge arrays). This proposal could make these cases more common, but I don't think it would be something new. The syntax containing a function call also helps hint at running arbitrary code for each iteration.

I totally agree here, though I will note that, as I understand it, the proposed syntax uses a function value as the argument to range, not a call (like the go statement), so it is not necessarily completely obvious that there are calls. So this may be a direct function reference (like @DeedleFake's example), or a function value defined in the function (for v := range fn {).

mknyszek commented 3 years ago

The word "closure" here is a red herring and an implementation detail of the example code. Closures (in Go terminology) are function literals, but there's no reason to enforce that, or to really even know. The word wanted here is "function value" (and method values are function values).

I would suggest retitling to "proposal: Go 2: function values as iterators"

Right. I'm aware of the distinction, though I figured it was important to make it clear that you can't really do much with an iterator that is a function value but not a closure. You can, for instance, return some value infinitely, and I suppose you can also iterate over some value in global variable as well.

Overall, what you suggest is indeed more precise. I'll update the proposal.

"More specifically, a for-range statement may accept variables" should say "More specifically, a for-range statement may accept values"

~You're probably right. I had some trouble with this because "range" isn't an operator. The most precise phrasing might be "More specifically, a for-range statement may accept expressions that evaluate to a function value of type ..." That way you get both a semantic and a syntactic description all in one.~

EDIT: Re-reading that passage, I think you're 100% right. I modified it to fit your phrasing. :)

Thank you for your suggestions!

tv42 commented 3 years ago

to make it clear that you can't really do much with an iterator that is a function value but not a closure.

You can replace any closure by defining a type that captures the free variables, returning a method value on that type. That often leads to more debuggable & readable code, too.

bcmills commented 3 years ago

A method value is, in a very real sense, a closure over the receiver.

(function literal∶closure∷square∶rectangle)

tv42 commented 3 years ago

Sure, but in that case I posit the term "closure" is not defined in the spec, and shouldn't be used for that reason alone.

Function literals are closures: they may refer to variables defined in a surrounding function.

That reads as an example, not an exhaustive normative definition, but it is the only place the word appears.

One of the strengths of Go is that the spec is very clean and easy to refer to.

ianlancetaylor commented 3 years ago

Just to clarify, it seems that with this proposal

for i := range f { ... }

is equivalent to

for i, next := f(); next; i, next = f() { ... }
josharian commented 3 years ago

And maybe:

for := range f { ... }

is equivalent to:

for next := f(); next; next = f() { ... }

And maybe for higher numbers of yielded values than 2?

Except presumably the iteration vars have the same capturing semantics that we all love so well.

Further into esoterica, range loops over channels zero out the iteration vars at the end of each iteration, to avoid pinning large objects in memory while waiting an arbitrary length of time for a channels receive. It might not be a bad idea to do that same here, were this accepted.

josharian commented 3 years ago

What about:

for i := range f { ... }

when f returns two values plus a bool?

If allowed, I presume it is equivalent to:

for i, _, next := f(); next; i, _, next = f() { ... }
mknyszek commented 3 years ago

@ianlancetaylor Yup, that's pretty much all it is, though also extended to the 3-return-value form (i.e. T, S, bool) as @josharian points out.

@josharian I think clearing the variable at the end of the loop iteration is a good idea, if this goes anywhere. RE: generating a new variable on each iteration would be nice, but I suspect it's still probably better to keep things consistent. Maybe that one experiment to look at all of GitHub will pan out. :)

griesemer commented 3 years ago

1) Is there any reason for restricting the function signature to a fixed number (2 or 3) results? It seems to me that one could allow any signature (without incoming arguments) where the last result is of type bool. For instance:

for a1, a2, ... an, ok := range f { ... }

is translated as explained above (for just two results). This could even make sense if there's only a single (boolean) result. I think @josharian alluded to this as well, above.

2) I'm not convinced that the suggested syntactic sugar (that's what it is, after all) buys that much. That said, if we chose this translation:

for {
   a1, a2, ... an, ok := f()
   if !ok { break }
   ...
}

we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

3) One issue that this proposal doesn't address is the problem of (efficiently) supporting iteration over recursive data structures such as a tree. Languages (e.g. Python) that support generators typically have some form of a yield operation which allows the iterator/generator to be written in a natural fashion. In Go we'd have to use a channel for cases like these. (Of course, the reason why we decided to co-opt channels for this is that running a generator function with a yield "in parallel" with an iteration using the yielded values is exactly because we have two coroutines in that case. It's just that we would prefer a more lightweight switchover.)

DmitriyMV commented 3 years ago

FWIW I don't think that 3rd point is that much of an issue. It can be solved using capturing variables in returning closure, and modifying them as we descend deeper into the data structure. For double linked types or single linked types that do not require backward iteration it will result in zero allocations. In in a case of single linked types that do require backward iteration (like tree) the cost can be amortized by preallocating "stack" and\or using sync.Pool.

I do think that this proposal fits quite well with generics proposal, since №1 usage of generics would be alternative implementation of lists/tree/map/queues and so on. That will result in the fact that 3rd party implementation usage would't feel "alien" as compared to slices/maps/ranges.

DmitriyMV commented 3 years ago

One thing I'd consider to be missing would be a way to inform the iterator that the caller is "finished" with the iterator, i.e. if the range loop were exited early for some reason.

@zikaeroh You can change

func (t *IntTree) Iter() func() (int, bool) {

to

func (t *IntTree) Iter() (it func() (int, bool), cancel func()) {

and then:


iter, cancel := tree.Iter()
for i := range iter {
    // Do stuff with i.
    if something {
        cancel()
    }
}
kokes commented 3 years ago

It might also be worth allowing the iterator's final return value to be an error instead of a bool, though I'm not totally sure how to surface that to the user.

I think this hints at another common practice today, most notably used by bufio, where the Scanner doesn't return the iterated value, it only returns a boolean and keeps the state within its struct. This allows for exposing any errors during said iteration.

for sc.Scan() {
  // do stuff with sc.Bytes() or sc.Text()
}
if sc.Err() != nil {
  // this is essential, but not obvious
}

And I think error handling is something that needs to be accounted for here as well. I don't think it's easy to come up with an API to incorporate error handling in iteration (especially with the proposed syntax leading to err falling out of scope beyond the for loop), but in many iteration cases, this is quite crucial - especially when doing I/O.

An obvious solution is to define said error to outlive the loop - but then you'd have to define T as well, because we can't use `:=`. So that would lead to yucky solutions like ```go var ( foo T err error ) for foo, err = f(); err != nil; foo, err = f() { ... } if err != nil { ... } ```

E.g. I can imagine looping through an HTTP API pagination and this may stop either when I get to the end or when there's an error in the response - I'd like to know which is which. With func() (T, bool), it's hard to communicate this.

That being said, I do like the proposal!

mknyszek commented 3 years ago
  1. Is there any reason for restricting the function signature to a fixed number (2 or 3) results? It seems to me that one could allow any signature (without incoming arguments) where the last result is of type bool. For instance:
for a1, a2, ... an, ok := range f { ... }

is translated as explained above (for just two results). This could even make sense if there's only a single (boolean) result. I think @josharian alluded to this as well, above.

I'm not opposed. I limited it to a fixed number to keep the loop looking familiar, but I don't feel strongly about it.

  1. I'm not convinced that the suggested syntactic sugar (that's what it is, after all) buys that much. That said, if we chose this translation:
for {
   a1, a2, ... an, ok := f()
   if !ok { break }
   ...
}

we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

RE: not buying much, that's totally fair. Changing its translation would indeed address the variable capture problem, but I worry about range having inconsistent semantics. I dislike the variable capture problem that I'm open to it, though. :)

  1. One issue that this proposal doesn't address is the problem of (efficiently) supporting iteration over recursive data structures such as a tree. Languages (e.g. Python) that support generators typically have some form of a yield operation which allows the iterator/generator to be written in a natural fashion. In Go we'd have to use a channel for cases like these. (Of course, the reason why we decided to co-opt channels for this is that running a generator function with a yield "in parallel" with an iteration using the yielded values is exactly because we have two coroutines in that case. It's just that we would prefer a more lightweight switchover.)

I think you can get by without a goroutine + channel here for a tree. As @DmitriyMV points out, you can manage a stack explicitly. Though, I do recognize it's not quite as "natural" as recursion and potentially has its own performance costs since that stack will almost certainly need to be heap-allocated.

ianlancetaylor commented 3 years ago

With a complex recursive data structure, the natural traversal code is recursion. So it's natural to look for some way to express that in Go. In Python it's done with yield. This proposal is interesting but it doesn't give us yield.

mknyszek commented 3 years ago

@ianlancetaylor I agree. It does not.

Here's where I'm at on this point: natural iteration over a complex recursive data structures is generally hard (maybe impossible?) to break down into a loop unless you reify a stack somehow, be that with a generator, coroutine, continuation, etc. Given that we already have the concept of a goroutine, adding a new notion of a ""captured"" (loosely speaking) call stack feels non-orthogonal. I think generators, given their more limited scope, are the best candidate among the choices out there. And it's worth noting that a goroutine + channel won't be able to fulfill the same role. The observation in this proposal is that a stateful closure may act as a substitute, but in every recursive case it's not going to be natural, since "reifying the stack" with a stateful closure means just maintaining a stack explicitly.

It's not perfect, and recursive iteration is still better served by something like func (a *Tree) Iterate(fn func(n *TreeNode)) after this proposal, but I do think the proposal would still capture useful cases like, say, reading from a stream of unknown length, without having the synchronization overhead of a channel. Or iterating over two (or more) parallel slices, for example.

Broadly speaking, I have other reservations about this proposal (this isn't my biggest one, even) and I'm not sure I'm the best champion of it at this point. But, I do think that there's some value in the observation made here (and the idea just being out there), so maybe it will serve as a useful basis for some future proposal. :)

I think there's also a good argument for pushing the following pattern instead of this proposal, which is not listed in the proposal itself (my bad!) but fairly widely used (including in the standard library):

x := MyCoolDataStructure{}
// ... populate x
iter := x.Iter()
for x.Next() {
    a, b, c := x.Values()
}
if err := x.Err(); err != nil {
    return err
}

One big upside to this is that you don't have the variable capture problem! 🎉

One downside here is that you need to handle that error explicitly, and it's not unreasonable to think that the error handling could be missed, especially if MyCoolDataStructure presents itself as, say, a tree, but that tree lives on a hard disk or something. Maybe the right step forward here is figuring out how to make that error more prominent? Or to indicate to go vet that "Err() needs to be checked before the value is thrown away!" I don't know. That seems difficult to check in general, but could still be useful for this specific pattern, locally applied.

Another downside is it doesn't really help with recursive data structures. It has the same problem. 🤷

deanveloper commented 3 years ago

One issue that this proposal doesn't address is the problem of (efficiently) supporting iteration over recursive data structures such as a tree

Recursive structures under this proposal honestly don't look that bad because we can use recursion in our iterators. Sure, it doesn't look as clear as yield leftIter(); yield value; yield rightIter(), but it still looks fairly nice:

https://play.golang.org/p/mYZHNSNSq1S

type Tree struct {
    Val int
    Left *Tree
    Right *Tree
}

func (t *Tree) InOrderIter() func() (int, bool) {
    if t == nil {
        return func() (int, bool) { return 0, false }
    }
    leftIter := t.Left.InOrderIter()
    middleDone := false
    rightIter := t.Right.InOrderIter()
    return func() (int, bool) {
        i, ok := leftIter()
        if ok {
            return i, true
        }
        if !middleDone {
            middleDone = true
            return t.Val, true
        }
        i, ok = rightIter()
        if ok {
            return i, true
        }
        return 0, false
    }
}
jba commented 3 years ago

I spent a lot of time thinking about iterators for the Cloud Client libraries for Go. I also spoke to some Go team members. The conclusion I came to was that no single pattern would work for all cases. It was better to establish a set of guidelines, and follow the ones that made sense for each particular iterator.

I think any language change will either be simple like this proposal and exclude many real-world cases, or too complicated to be worth the complexity.

carnott-snap commented 3 years ago

Would generics help? After #43651 is implemented, you could implement an adaptor similar to IntTree.Iter:

though this would be far from zero cost, without some help from the compiler


package iter

type Able[T] interface{ Iter() func() T }

func Ator[T any](iter Able[T]) []T { var t T var s []T next := iter.Iter() for e := next(); e != t; e = next() { s = append(s, e) } return s }

```go
type tree[T any] struct{ /*...*/ }

func (tree[T]) Iter() func() T { /*...*/ }

func do(i Tree[int]) {
        for _, e := range iter.Ator(i) {
                fmt.Println(e)
        }
}
deanveloper commented 3 years ago

@carnott-snap I am personally opposed to that design... There is no difference between that and simple ToSlice() []MyType function.

There are many more advantages to the abstract idea of iterators than just iterating over slices, namely not needing to load the entire iterable data in memory at once (ie iterating over a large file) which also grants the ability to have infinite iterators (ie iterating on the fibonacci sequence).

carnott-snap commented 3 years ago

I completely agree, and like your name better, since there are other uses for slices. My, arguably cheeky, example was more a suggestion for a middle term solution, since traction for this seems weak.

That being said, you could implement iter.Ator within the compiler, think the unsafe or runtime, such that for i := range iter.Ator[uint](math.Fibonacci) {} never actually halts. Not sure if this is more or less compelling, but it keeps the language simpler?

Where math.Fibonacci is:


package math

var Fibonacci fibonacci

type fibonacci struct{}

func (f *fibonacci) Iter() func() uint { var u uint return func() uint { u++ return uint(math.Pow(math.Phi, float64(u))/math.Sqrt(5) + 0.5) } }

jba commented 3 years ago

This proposal violates an important property of Go: syntax never runs user code. That is the main reason why Go code is so readable. It can be annoying at times, like when you have to construct some sort of comparable value to use as a map key, but I think it's important to preserve.

The problem is not that the construct can block or consume arbitrary CPU; it's that the range syntax can run code that wasn't written into the compiler or runtime.

mvdan commented 3 years ago

@jba I get what you mean intuitively, but I still argue that such a hard rule does not exist in practice. A return statement can indirectly run arbitrary code via defers or runtime finalizers, and imports run init functions, to name some examples where you're not calling or running code in a way that's syntactically obvious.

I also don't think talking about arbitrary code helps that much. Take my example of ranges making huge copies earlier; I can almost guarantee you that a for range taking up seconds of CPU time doing copies would confuse most Go users, yet it would not be caused by your definition of arbitrary code.

deanveloper commented 3 years ago

@carnott-snap at that point, why not just allow math.Fibonacci() to be iterable? Also, what would iter.Ator(math.Fibonacci()) do outside the context of range?

carnott-snap commented 3 years ago

@deanveloper: you totally could, I just wrote it to the spec i designed above. Plus, there may be other methods you want to expose, and what if two callers want different fibonacci iterators.

By my implementation iter.Ator would be used exclusively for range statements, thought iter.Able could be called directly, or integrated elsewhere.

DeedleFake commented 3 years ago

It was pointed out in the discussion for the proposed container/set package that most iterator proposals don't allow iteration over maps in an efficient way without hooking into the runtime directly. For reference, see https://github.com/golang/go/discussions/47331#discussioncomment-1036487. yield can do it, as mentioned above, but I don't think that a language change is actually necessary. Kotlin manages to fake yield via higher-level functions alone, and I think that Go could do much the same, though the repeated typing of the closures would probably get kind of annoying, a la #21498.

As an example, what if the function signature of this proposal was changed from func() (T, bool) to just simply func(emit func(T))?

Disclaimer: I wrote the following code at 3 in the morning and didn't try to compile it. Edit: Here's a playground link. It times out sometimes, but if you try several times it seems to start working eventually, usually.

// Could call it yield instead of emit. I'm not particularly partial to either. Kotlin calls it
// emit, so I went with that for this example.
type Iter[T any] func(emit func(T))

// New returns its own argument to allow for type inference.
func New[T any](iter Iter[T]) Iter[T] {
  return iter
}

func Of[T any](vals ...T) Iter[T] {
  return New(func(emit func(T)) {
    for _, v := range vals {
      emit(v)
    }
  })
}

func OfMap[K comparable, V any](m map[K]V) Iter[Pair[K, V]] {
  return New(func(emit func(Pair[K, V])) {
    for k, v := range m {
      emit(NewPair(k, v))
    }
  })
}

// I still don't like top-level functions for things like map and filter, but there
// doesn't really seem to be an alternative, unfortunately.
func Map[T any, R any](iter Iter[T], m func(T) R) Iter[R] {
  return New(func(emit func(R)) {
    iter(func(v T) {
      emit(m(v))
    })
  })
}

// Tree wouldn't really be in the same package in an actual implementation, obviously.
type Tree[T any] struct {
  Left, Right *Tree[T]
  Value T
}

func (t *Tree[T]) InPlaceIter() Iter[T] {
  if t == nil {
    return Of[T]()
  }

  return New(func(emit func(T)) {
    t.Left.InPlaceIter()(emit) // Not thrilled about the double parentheses.
    emit(t.Value)
    t.Right.InPlaceIter()(emit)
  })
}

Error handling isn't really covered by this, so more complicated stuff like iterating over the results of scanning with a bufio.Scanner might be a bit tricky, but that could probably be handled by some kind of Result[T any] type that can be either a value or an error. Emitting those would make things like mapping with errors easier than if they were just simply returned, too.

Also, code similar to the above causes the current tip of the dev.typeparams branch to choke, last I checked. It seems to have trouble parsing the definition of Map() in particular. It worked with go2go, though.

deanveloper commented 3 years ago

Alright, I've come to like emit/yield a lot more than the proposed solution. I recently had to write a recursive permutation (with repeats) algorithm, and I decided I would try to write it under how iterators work with this proposal. However I did not come to like the solution very much.

The recursive permutation algorithm looks like this when using slices, it's quite intuitive:

func Permutation(n, r int) [][]int {
    // base case r==0, return a single nil slice
    if r == 0 {
        return [][]int{nil}
    }

    var permutations [][]int
    for _, subPerm := range Permutation(n, r-1) {
        for i := 0; i < n; i++ {
            newPerm := make([]int, r)
            newPerm[0] = i
            copy(newPerm[1:], subPerm)

            permutations = append(permutations, newPerm)
        }
    })
    return permutations
}

The pattern that @DeedleFake proposes, we see that it looks nearly identical to our solution using slices!

func PermutationYield(n, r int) func(yield func([]int)) {
    return func(yield func([]int)) {

        // base case r==0, return a single nil slice
        if r == 0 {
            yield(nil)
            return
        }

        subPermIter := PermutationYield(n, r-1)
        subPermIter(func (subPerm []int) {
            for i := 0; i < n; i++ {
                newPerm := make([]int, r)
                newPerm[0] = i
                copy(newPerm[1:], subPerm)
                yield(newPerm)
            }
        })
    }
}

However under the proposed "function values as iterators" pattern, we need to do a lot of workarounds in order to keep track of where we are in the iterator. This leads to a headache to both write, and to read:

func PermutationFunc(n, r int) func() ([]int, bool) {

    // base case r==0, return a single nil slice
    if r == 0 {
        var singleReturned bool
        return func() ([]int, bool) {
            if singleReturned {
                return nil, false
            } else {
                singleReturned = true
                return nil, true
            }
        }
    }

    var currentPrefix int
    var currentSubPerm []int

    subPermIter := PermutationFunc(n, r-1)

    return func() ([]int, bool) {
        if currentPrefix == 0 {
            newSubPerm, ok := subPermIter()
            if !ok {
                return nil, false
            }
            currentSubPerm = newSubPerm
        }
        newPerm := make([]int, r)
        newPerm[0] = currentPrefix
        copy(newPerm[1:], currentSubPerm)

        currentPrefix = (currentPrefix + 1) % n
        return newPerm, true
    }
}

I personally think that the emit/yield pattern is much easier to both write AND to read. I think @DeedleFake may have gotten a bit lost in abstraction with their example, which may have led to some people not understanding the simplicity of the emit/yield solution.

edit - Forgot to respond with the playground link, which shows that the two solutions have identical output: https://play.golang.org/p/MIX5qrxCgK2

edit 2 - Replaced Javascript solution with existing solution in Go, returning a slice instead of a form of iterator.

bcmills commented 3 years ago

(On a meta level: I think it would probably be useful to open a GitHub Discussion for iterators. The design space is wide.)

bcmills commented 3 years ago

@deanveloper, I have two concerns with the emit / yield approach:

1. It does not provide a clear way to shut down the iterator.

Making yield panic to shut down would be extremely unidiomatic.

Making it return a bool could perhaps work, but then it wouldn't be obvious to me whether a false return means “I consumed the value, don't produce another one” or “I didn't actually consume the value after all, put it back”. In the former case, there is no way to avoid consuming the first value from the iterator; in the latter case, iterator implementations would be forced to provide the capability to reject an already-produced value, which would be easy enough for iterators over static containers but much more complex for, say, ordered queues.

2. While it makes iterators over static data much simpler to write, it makes algorithms that split or combine multiple iterators much more difficult. Consider, for example:

// Interleave returns the elements from each iter in iters in round-robin order.
func Interleave[E any](iters ...Iterator[E]) Iterator[E]

How would you implement that using yield? (I believe I would be possible, but quite complex — and may involve multiple goroutines, in which case you'd also have to worry about managing goroutine lifetimes.)

deanveloper commented 3 years ago

It does not provide a clear way to shut down the iterator.

This is true, I believe that your solution with returning a bool may be the best solution. However I will be talking about a different solution in a moment.

Making it return a bool could perhaps work, but then it wouldn't be obvious to me whether a false return means “I consumed the value, don't produce another one” or “I didn't actually consume the value after all, put it back”. In the former case, there is no way to avoid consuming the first value from the iterator; in the latter case, iterator implementations would be forced to provide the capability to reject an already-produced value, which would be easy enough for iterators over static containers but much more complex for, say, ordered queues.

This is the case for any iterator implementation, is it not?


I did a bit of research (it's been a long time since I've used Kotlin) and the documentation for yield can be found here. Essentially their implementation of sequence does what our yield does, except they translate it into a traditional Java iterator. However looking into it, it looks like it uses a busy wait to implement it. Although looks like there's some coroutine stuff in there as well, so maybe it isn't an actual busy wait.

Come to think of it, the yield solution is basically the "channel" solution for iterators, although doesn't spin up a new goroutine. If the proposed solution is allowed as an iterator, there could be a standard library function to convert a channel-based iterator into a function-based iterator:

package iterators

func FromChanFunc[T any](func (chan<- T)) func(T, bool) {
    ch := make(chan T)
    go func() {
        func(ch)
        close(ch)
    }()
    return func(T, bool) {
        val, ok := <-ch
        return val, ok
    }
}

Not sure how much I like this though, especially in the stdlib. It just doesn't feel right for a pretty primitive stdlib function to spin up a goroutine.

Really the question comes down to "can we guarantee to consume only a finite number of values with the yield solution?` where the answer seems to be no, unless we add a boolean return. However it may be pretty easy for someone writing an iterator to forget about the boolean, and forcing a return value makes the function a bit harder to use.

However with the proposed solution, we can't iterate over a map, and iterating over recursive structures is quite difficult. Not sure quite what the best solution is, but seems that pretty much any other modern language (at least javascript, rust, kotlin, python) is to have an Iterator (similar to this proposal) as well as a "generator" which uses coroutines. Not sure if we want to go with that solution, although generators do make writing iterables a lot easier. Granted we may not really need a generator since we have channels, and generators would need to rely on goroutines anyway. So in a way, we already have them!

bcmills commented 3 years ago

In the former case, there is no way to avoid consuming the first value from the iterator; in the latter case, iterator implementations would be forced to provide the capability to reject an already-produced value …

This is the case for any iterator implementation, is it not?

No. With the naive “function as iterator” approach there is no need to put back values. (To avoid consuming the first value from the iterator, just don't invoke the function.)

The yield or channel-based approach unrolls the pipeline by one call, so that the first value is produced before it is first needed (instead of “on demand” at that point).

bcmills commented 3 years ago

However with the proposed solution, we can't iterate over a map,

Sure we can! We just can't iterate over a map without using reflect. 😅 (https://go2goplay.golang.org/p/soktqtU8HF5)

and iterating over recursive structures is quite difficult.

Maybe? I guess you'd need to maintain an explicit stack of nodes, but I also wonder if we could provide a helper to simplify that.

deanveloper commented 3 years ago

Ah - I see what you mean now. In my recent project I've refactored away from the yield solution, and decided to just go with channel iterators, which as I stated seem to function the exact same as generators in other languages (albeit possibly running on a separate thread).

Maybe? I guess you'd need to maintain an explicit stack of nodes, but I also wonder if we could provide a helper to simplify that.

Come to think of it, with some iterator utilities, it might be pretty simple. Here is my Permutation example from earlier, which is essentially a DFS with a variable number of children:

func Permutation(n, r int) func() ([]int, bool) {

    // base case r==0, return a single nil slice
    if r == 0 {
        return iterators.Of([]int{})
    }

    // now that's a bit of a doozy for an anonymous function signature...
    return iterators.FlatMap(Permutation(n, r-1), func(subPerm []int) func() (func() ([]int, bool), bool)  {
        return iterators.Map(iterators.Range(0, n), func(i int) func([]int, bool) {
            newPerm := make([]int, r)
            newPerm[0] = i
            copy(newPerm[1:], subPerm)
            return iterators.Of(newPerm)
        })
    })
}

Or for an in-order traversal of a binary tree:

type Tree struct {
    Val int
    Left *Tree
    Right *Tree
}

func (t *Tree) InOrderIter() func() (int, bool) {
    if t == nil {
        return iterators.Of[int]()
    }

    return iterators.OfIterators(
        t.Left.InOrderIter(),
        iterator.Of(t.Val),
        t.Right.InOrderIter(),
    )
}

I unfortunately can't think of a way to get a yield-like solution that doesn't involve spawning a goroutine, especially since the iterator doesn't know when it'll be thrown away.

deanveloper commented 3 years ago

Note that if #19702 is reconsidered, then channels would be able to be used as iterators without any language changes.

earthboundkid commented 3 years ago

we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

I don’t like the idea that you might get the good variable capture behavior based on the argument to range. It should be obvious at the call site what behavior it has with respect to capture without thinking about if the argument is a slice or an iterator or what. The result of such nuanced behavior is that users would start wrapping slices in iterators just to get the new style capturing. Yuck.

A whole new keyword or new syntax are really the only ways to solve that wart responsibly.

deanveloper commented 3 years ago

As @mvdan said, it may be a good idea to open up a GitHub Discussion for iterators. I think it would be really useful and a lot more organized, especially since it seems like there are quite a few ideas about how to solve the problem. Iterators are also very important to have since they essentially block the container/set proposal, as well as any other iterable, non-built-in data structure.