mknyszek commented 3 years ago

Proposal: Function values as iterators

Motivation

A number of proposals have come up over the years of supporting some first-class notion of iterator such that loops that use the range keyword may be used to iterate over some custom data structure. The Go community in general has also wondered about the "right" way to describe an iterator, evidenced by this blog post from 2013 that describes many of the ways Go programmers abstract away iteration, and much of this is true today as well.

Overall, however, iteration over a non-builtin data structure may be somewhat error-prone and clunky. To refresh on the various ways that iteration may be abstracted away in Go, let's work through some examples.

Firstly, we may have some tree type with some dedicated iterator type:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() IntTreeIterator {
    ...
}

type IntTreeIterator struct {
    ...
}

func (t IntTreeIterator) Next() (int, bool) {
    ...
}

One way such an "iterator" could be used is,

iter := tree.Iter()
for {
    i, ok := iter.Next()
    if !ok {
        break
    }
    // Do stuff with i.
}

or even,

iter := tree.Iter()
for i, ok := iter.Next(); ok; i, ok = iter.Next() {
    // Do stuff with i.
}

The former usage works fine, but suffers from readability issues. The meaning of the code is obscured somewhat as the reader first sees an apparently infinite for-loop and must look for the "break" statement to understand what's going. Luckily this condition is usually present at the top of the loop, but it requires a more careful look. Furthermore, the iteration condition needs to be written explicitly. Writing it once may not be a problem, but writing it 100 times might be.

The latter usage also works fine and the intent is more clear, but it has a similar problem with the iteration condition. There's also an element of repetition which on the surface is fine, but it does harm the readability of the loop. Especially with variable names like "i" it becomes easy to get lost in punctuation.

Another way to abstract iteration away is to pass a function value to a function that iterates on behalf of the caller. For example:

type IntTree struct {
    ...
}

func (t *IntTree) ForEach(func(i int)) {
    ...
}

tree.ForEach(func(i int) {
    // Do stuff with i.
})

This method works well in many scenarios, but is decidedly less flexible as it separates the loop body from the surrounding code. Capturing local variables in that function value helps, but potentially at the cost of some efficiency, depending on the complexity of the iteration. One advantage of this method, though, is that defer may be used to perform clean up on each loop iteration, without allocating a defer (thanks to open-coded defers).

Prior art

A previous proposal (#40605) suggested allowing types with a Next method to have that method repeatedly called when used in a range loop:

iter := tree.Iter()
for i := range iter {
    // Do stuff with i.
}

This works fine, but from my perspective, doesn't feel very Go-like. Having a language construct be aware of a type's methods for the sake of syntactic sugar is not a pattern found anywhere else in the language (yet). In the parlance of the generic design, the existing range keyword matches on the underlying type used, not the type's methods.

Furthermore, it usually requires defining a new type which is a bit more work for the writer of the code as well as the reader. Overall a bit clunky, but not bad. It lines up well with how other languages work. Rust, for instance, uses a trait (analogous to an interface) to determine whether a type is an iterator.

Another previous proposal (#21855) suggested supporting a two-clause for-loop to make iterating less error-prone, such as:

iter := tree.Iter()
for i, ok := iter.Next(); ok {
    // Do stuff with i.
}

Unfortunately, a two-clause loop form is itself error-prone, as the placement of a single semicolon has a significant effect on semantics (specifically, the second semicolon which distinguishes between the 2-clause and 3-clause forms). This proposal was rejected because that semicolon was considered too dangerous.

Other proposals (#24282) have been made to fundamentally change how for loops work, indicating at least some degree of friction.

Proposal

Rolling with the idea of closures, and with the observation that the range form matches on types largely structurally, I would like to propose that range loops repeatedly apply function values of a particular form.

More specifically, I propose allowing the for-range statement to accept values of type func() (T, bool) or func() (T, S, bool) (where T and S are placeholder types; they may be any substituted for any other type) and will repeatedly call these values until their bool result is false.

Iterators may then be written in the following way:

type IntTree struct {
    ...
}

func (t *IntTree) Iter() func() (int, bool) {
    ...
    curr := t.root
    return func() (int, bool) {
        // Use curr to keep track of your position, repeatedly find the successor of curr
        // until there is none.
    }
}

for i := range tree.Iter() {
    // Do stuff with i.
}

More precisely, the last for statement "de-sugars" into the following code, where tmp is a temporary variable not visible to the program:

tmp := tree.Iter()
for i, ok := tmp(); ok; i, ok = tmp()  {
    // Do stuff with i.
}

The limitation to variables of type func() (T, bool) or func() (T, S, bool) (instead of allowing an arbitrary number of return values) is to keep range loops looking familiar and to avoid a misuse of the syntax.

Discussion and observations

Pros:

Efficient: with appropriate inlining, the Go compiler can stack-allocate any values captured by the function value returned by Iter and perform a direct function call for each iteration (theoretically that could be inlined further, but it depends).
Shifts API complexity to the writer: users get all the benefits of the range form, but writers have a little more work to do to generate the function value.
Lightweight: allows the writer to avoid defining a new type in many cases.
Sets the standard: by picking this structure, it sanctions a particular way of writing iterators at the language level.
Composes well: with generics, iterators defined this way can be easily transformed (map, filter, zip, etc.).

Cons:

Hidden behavior: similar to operator overloading, there's now one more thing that range could mean, and each call could be arbitrarily expensive (e.g. a series of HTTP requests).
- Counter-point: (credit to @mvdan) channels may cause a for-range loop to block for an arbitrarily long time, and copies created during iteration could also cause surprising CPU usage.
Outsized impact: a whole language change for something that is arguably only a minor issue.
Unfamiliar: users coming from other popular statically-typed languages might find this behavior surprising.
Ambiguity: the range keyword doesn't always make sense to the reader (it does for custom data structures and in some other cases, though).
Ambiguity: the type func() (T, bool) may not always mean "iterator." A Next method is more explicit.
- Counter-point: the fact that it doesn't accept any arguments is a pretty good signal, though.

This idea was inspired by the way Python generators are used as iterators. I recently had the realization that Python generators can approximated by repeated application of closures. I thought that this would have been proposed already, but I couldn't find anything like it. Interestingly, this proposal also allows for iterating over infinite generators and such. It can also be used to write Python-like loops (for better or for worse):

// Linspace returns an iterator that iterates from start (inclusive)
// to stop (exclusive) by stride. start must be <= stop.
func Linspace(start, stop, stride int) func() (int, bool) {
    i := start
    return func() (int, bool) {
        i += stride
        return i, i < stop
    }
}

for i := range Linspace(0, 10, 2) {
    // i = 0, 2, 4, 6, 8
}

It might also be worth allowing the iterator's final return value to be an error instead of a bool, though I'm not totally sure how to surface that to the user.

I'm not even totally convinced myself that this is worth doing, since the benefit seems minor at best, but I figured I should at least put the idea out there.

scott-cotton commented 3 years ago

@bcmills question about one of your recent comments:

Making it return a bool could perhaps work, but then it wouldn't be obvious to me whether a false return means “I consumed the value, don't produce another one” or “I didn't actually consume the value after all, put it back”. In the former case, there is no way to avoid consuming the first value from the iterator; in the latter case, iterator implementations would be forced to provide the capability to reject an already-produced value, which would be easy enough for iterators over static containers but much more complex for, say, ordered queues.

I thought "false" would mean: "ignore the returned value because I'm done". In this case, the first call on an empty iteration will return false, and the we do not have to put anything back.

I also am not particularly enthusiastic about the idea of user defined ranges, but I am not opposed. So far, having range accept a function of signature func() (T, bool) or func(I, T, bool) seems the most simple to me.

I would like to encourage proposals for user defined Rangers to discuss guidelines for concurrency.

deanveloper commented 3 years ago

I thought "false" would mean: "ignore the returned value because I'm done". In this case, the first call on an empty iteration will return false, and the we do not have to put anything back.

How would I do this for a queue-like data structure where I can’t easily put the element back? Or an iterator over network calls?

scott-cotton commented 3 years ago

I thought "false" would mean: "ignore the returned value because I'm done". In this case, the first call on an empty iteration will return false, and the we do not have to put anything back.

How would I do this for a queue-like data structure where I can’t easily put the element back? Or an iterator over network calls?

If you are thinking about concurrency, I think any queue has a race condition for iterating unless the channel idea of closing is introduced.

But for a common queue, why not

type Queue[T any] ...

func (q Queue[T]) Ranger() func() (result T, done bool) {
    return func() (T, bool) {
        if q.hd == q.tl {
            return
        }
        done = true
        idx := q.hd
        q.hd++
        result = q.data[idx]
        return
   }
}

deanveloper commented 3 years ago

@wsc0 I was not discussing concurrency - that comment was aimed at the yield proposal (where a bool return decides whether to continue or not). The issue with the boolean return is that in order to say “do not iterate”, the function argument for the iterator would need to be called once, meaning that the iterator would have already consumed a value.

For something like an iterator of sequential network calls, this would mean that in order to “take 0 values from the iterator”, you would need to take 1 value of the iterator in order to say “please don’t keep iterating”

scott-cotton commented 3 years ago

@deanveloper Ah I see now. I always found yield hard to wrap my head around.

deanveloper commented 3 years ago

@wsc0 Yield in general (the generator pattern) is actually a very nice pattern for iteration. It allows you to make iterators in the exact same solution as if you were appending to a slice, as seen in https://github.com/golang/go/issues/43557#issuecomment-895211452. It was just that this particular implementation of it was poor, so there were quite a few problems.

DeedleFake commented 3 years ago

@wsc0

What I always found confusing about yield in Python was that the usage of yield somewhere in the function automatically caused the function itself to behave completely differently from any other function, but there was no difference of any kind in the function signature. I think that a lot of that confusion can be avoided if it's more obvious that a yielding function isn't a normal function, assuming language-level support for it. If the support for it isn't language-level than I think that that'll solve itself. Good documentation helps, too.

deanveloper commented 3 years ago

I think I may open a proposal which is a reconsideration of https://github.com/golang/go/issues/19702, perhaps with some more details and a more persuasive case. It seems to have been closed for a few reasons:

There should be communication between the reader and writer of a channel when the channel is closed. (While I largely agree with this, it doesn't make as much sense for iterators).
Deadlock stack traces would not be useful, as the goroutines would be removed. (Perhaps under that propsal this is true. However, under my prospective proposal, channels which still have references at both ends would not be garbage collected. This means that in actual deadlocked situations, where references to both ends of a channel are still held, we will still see deadlocked goroutines).

We already have generator-based iterators in the form of channels, but we cannot use them because we would need to exhaust the channel in order to stop the goroutine. In my eyes, the only blocker that I'm not really sure about is defer and whether or not defers/finalizers should be ran when the goroutine is GC'd. I'm thinking not, but I'll think more about this either when I write the proposal or a discussion is opened.

earthboundkid commented 3 years ago

19702 is locked, so can't comment there, but what about something like `runtime.Deadlocked() <-chan struct{}` used like this?

func numbers() <-chan int {
    ch := make(chan int)
    go func() {
        for i := 0; ; i++ {
            select {
            case ch <- i:
            case <-runtime.Deadlocked():
                return
            }
        }
    }()
    return ch
}

ISTM the trouble with the issue as opened before was A) it's expensive to find the dead channels B) there was ambiguity about how to handle the exit (runtime.Goexit vs. defers, etc). Having an explicit select statement resolves those, no?

earthboundkid commented 3 years ago

Even better, with generics, you could write:

func yielder[T any](ch chan T) func(t T) {
    return func(t T) {
            select {
            case ch <- t:
            case <-runtime.Deadlocked():
                return
            }
    }
}

func numbers() <-chan int {
    ch := make(chan int)
    yield := yielder(ch)
    go func() {
        for i := 0; ; i++ {
            yield(i)
        }
    }()
    return ch
}

😆

Okay, that's a sort of pointless way to get out of writing a select statement, but it's interesting in principle as part of thinking about iterator functions. ISTM, we already have a really good general purpose iteration mechanism (channels), but we don't use it because A) it's too slow/expensive and B) there's no good way to handle closing. Can we solve problems A and B and thus get out of adding a whole new mechanism for iteration?

deanveloper commented 3 years ago

Here is a pretty good wrapper function that could be in the standard library, provided that runtime.Deadlocked would exist sometime in the future:

package chans

func Generator[T any](generator func(yield func(T) bool)) <-chan T {
    ch := make(chan T)

    go func() {
        generator(func(incoming T) bool {
            select {
            case ch <- incoming:
                return true
            case <-runtime.Deadlocked():
                close(ch)
                return false
            }
        })
        close(ch)
    }()
    return ch
}

which would be used like:

package main

func Fibonacci() <-chan int {
    return chans.Generator(func(yield func(int) bool) {
        var a, b = 0, 1
        for {
            ok := yield(a)
            if !ok {
                return
            }
            c := a + b
            a = b
            b = c
        }
    })
}

The ok return mainly exists for infinite generators to prevent infinite looping, but in the common case, if it is ignored it shouldn't break programs (unlike the original yield/emit solution). Alternatively we could do some sort of panic, but I think that may cause some issues.

go2go link seems to time out sometimes, but it works other times: https://play.golang.org/p/28dKug56JdQ

However this seems to be an issue with go2go because the non-generic solution works fine in the regular playground: https://play.golang.org/p/28dKug56JdQ

earthboundkid commented 3 years ago

That can be simplified to

func Fibonacci() <-chan int {
    return chans.Generator(func(yield func(int)) {
        var a, b = 0, 1
        for yield(a) {
            c := a + b
            a = b
            b = c
        }
    })
}

Having an explicit deadlock test also makes it easier to deal with the "how to put the queue item back" problem, since there's an explicit way to check for the loop being exhausted and then run arbitrary code.

DeedleFake commented 3 years ago

While I'm not sure if runtime.Deadlocked() is exactly the right approach, I do think that it's a good idea. Cooperative cancellation, as with context.Context, feels a lot more Go-like. Maybe, at least to start, it might make sense to make it runtime.deadlocked() instead and then only create a iter.Generator() wrapper around it with an unsafe hook into the unexported runtime API. If you export it, it's going to encourage people to rely on other people checking for that deadlock and you're right back to the same situation as the ignored boolean, but now for potentially every single channel usage that crosses an API boundary.

In whichever case, though, I also worry a little about potential performance issues stemming from having to multi-thread this. I'm not entirely sure that just directly using a channel is the correct approach for a generic iterator system, either. I think that this should just be one possible way to create an iterator, be it iter.Iter[T] or whatever else. Otherwise middleware functionality like map and filter is going to be kind of weird.

@deanveloper

go2go link seems to time out sometimes, but it works other times

The go2go playground's been having that problem a lot recently. I had a ton of trouble with it the other day.

Also, several of the examples above forgot the bool return for yield in the function signature.

deanveloper commented 3 years ago

Also, several of the examples above forgot the bool return for yield in the function signature.

My bad - I had added the bool in later and forgot to re-copy the examples into my comment. I've fixed that now.

deanveloper commented 3 years ago

In whichever case, though, I also worry a little about potential performance issues stemming from having to multi-thread this.

Correct me if I am wrong, but pretty much every Generator uses some form of coroutine in order to implement the pattern. I've checked Kotlin and Rust at the very least. However I also believe that they control the coroutines a bit better so that it still operates single-threaded, and the value-passing also works differently from using a channel. The goroutine/channel overhead is quite the problem at large scale, in this case I have a binary tree of 10,000 elements and show how slow iteration is on each one: https://play.golang.org/p/1asKVzsdLNn. You cannot get results in the go playground, as time is fixed. Feel free to try on your own machine.

Results on my machine are ~1ms for purely-function-call based iteration, and ~480ms for the channel-based iteration. I expected the performance to be worse, but not that bad. Perhaps some optimizations could be done to chans.Generator specifically, but I'm not very hopeful that we can get close enough.

EDIT - As a test, I implemented it in Javascript, and it runs for ~9ms. There definitely has to be a more optimal way to do this without goroutine overhead. https://gist.github.com/deanveloper/32d6c0c9b915f464cb917d5f76c34dd9

EDIT 2 - Did some more testing: https://play.golang.org/p/87HxaEmecno (again, execute locally for correct timings). Even with only a single goroutine created, the channel overhead is quite high. To iterate over a 10k element slice, it takes 20ms with chans, and 0.2ms with functions. That's 2 orders of magnitude... I'm not sure if we can get optimizations to help that much.

DeedleFake commented 3 years ago

I think without language-level support, there's going to be some pretty bad performance issues that aren't really easily resolvable. It might be possible to optimize the channel for this specific usage and cut out some of the overhead, as I'm guessing that quite a bit of that comes from locking that isn't useful in this particular situation.

earthboundkid commented 3 years ago

Setting GOMAXPROCS to 1 cuts the tree time by 20% and the slice time by 50%, but clearly there are still orders of magnitude of overhead to work out. I do think it is theoretically doable though, since many other languages have a coroutine mechanism. There "just" (ha!) needs to be some way to detect that the routines will always yield to each other and set them to a simpler scheduler once that's detected.

jba commented 3 years ago

I think the best way to handle functions that need to have some sort of cleanup or close is to return a second func() value:

gen,  close := chans.Generator(func(yield func(int) bool) {...})
defer close()

No runtime support is necessary, and the pattern is familiar to Go programmers from context.Cancel.

deanveloper commented 3 years ago

Unfortunately that remove the ability to have "inline" generators, like my earlier examples (ie for i := range Fibonacci()). That is, unless we add a language change, such that range on a function call that returns (<-chan T, func()) will call the 2nd return value when the for loop exits. That seems like a strange language feature outside of iterators and it would only work on function calls.

It may be quite bothersome to manually call an extra close() when iterating, and it makes passing around iterators a lot more difficult as well. In IO, pipelining is a lot easier, since it is all handled in a single variable (which may implement io.Closer). But needing to pass around 2 variables for a single iterator is a bit cumbersome.

jba commented 3 years ago

@deanveloper, I don't think those are big problems.

Unfortunately that remove the ability to have "inline" generators, like my earlier examples (ie for i := range Fibonacci()).
gen, close := Fibonacci()
defer close()
for i := range gen ...
But needing to pass around 2 variables for a single iterator is a bit cumbersome.

You normally wouldn't do that. Think about how you use *os.File: the code that opens the file is responsible for closing it. For iterators it would be:

gen, close := Fibonacci()
defer close()
process(gen)

I'm sure there are situations where the close needs to happen somewhere else, but I'm guessing they're rare.

As for this all being cumbersome, that's true, but I don't think the language or runtime features you're hoping for are likely to happen. You could also do something like

type Iter[T any] interface {
    Next() (T, bool)
    Close()
}

and then it would be more like a file, but it turns out that forgetting to call Close is a real problem, so I think the clumsiness is actually helpful.

jba commented 3 years ago

I've actually been working for a bit on speeding up this conversion from the "yield" style iterator to a one-at-time iterator. I think the first step is to remove the explicit channel:

type Iter func() (int, bool)

You get the next element by calling the function; a second return value of false tells you you're done. (@bcmills has designed a nice iterator package with generics using a similar definition.)

Now that the channel is hidden, you can play tricks like using a chan []T to buffer the collected elements. The fastest I've been able to come up with so far takes a user-supplied buffer to avoid allocation, splits it in two, and swaps the halves between the producer and consumer in a kind of double-buffering:

func ToIter(b []int, ranger func(f func(int) bool)) (Iter, func()) {
    c := make(chan []int)
    done := make(chan struct{})
    n := len(b) / 2
    if n == 0 {
        panic("buffer too small")
    }
    go func() {
        b1 := b[:0]
        b2 := b[n:n]

        // producer
        ranger(func(v int) bool {
            b1 = append(b1, v)
            if len(b1) >= n {
                select {
                case c <- b1:
                case <-done:
                    return false
                }
                // Since channel is unbuffered, at this point the consumer                                                                                                                                                                    
                // has taken b1, meaning it no longer needs b2.                                                                                                                                                                               
                b1, b2 = b2[:0], b1
            }
            return true
        })

        // consumer
        if len(b1) > 0 {
            select {
            case c <- b1:
            case <-done:
            }
        }
        close(c)
    }()

    return iterFromSliceChannel(c), func() { close(done) }
}

func iterFromSliceChannel(c chan []int) Iter {
    var (
        s  []int
        i  int
        ok bool
    )
    return func() (int, bool) {
        if i >= len(s) {
            s, ok = <-c
            if !ok {
                return 0, false
            }
            i = 0
        }
        i++
        return s[i-1], true
    }
}

For enumerating the values of a simple binary tree, this code is about 40% slower than calling the yield iterator directly, for large enough trees (with a 16K buffer, around a million nodes). It's much worse for smaller trees, because of the initial overhead.

For small trees it's probably better to just use the yield iterator to populate a slice, and iterate over the slice.

deanveloper commented 3 years ago

Ahaha, and we're back at where we started. Earlier, I was thinking about if we could possibly get yield-style iterators to work in the more classical iterator fashion, which would be the most ideal way to do it. My idea was similar to yours, before realizing that we can simply use channels as iterators if we got the needed performance improvements.

To organize some of my thoughts, here is what I think of the main proposed solutions, in descending order of preference:

Using channels as iterators with performance improvements and runtime.deadlocked() <-chan struct{}
- This allows us to have efficient iterators without any language changes. Iterators would be able to be GC'd properly.
Using channels as iterators with performance improvements, and a returned cancel function
- Again, this allows us to have efficient iterators without any language changes, as well as runtime changes.
- Iterators are not something that need to be "closed" in other languages, I would expect that they get garbage collected like anything else.
The original proposed iterator, with utility functions to convert yield-style iterators to "real" iterators
- This is nice because we wouldn't need to make optimizations to goroutines and channels. I'm not sure how hard they are to optimize, but synchronization in general is a pretty hard problem.
- This does require making language changes if we want it to be range-able, but likely for the better.
- Without the utility functions, these iterators may be extremely hard to write.
The generator-pattern/yield-style/"Do" iterators
- Very easy to write iterators.
- Poorly written iterators which ignore the return value will have drastic consequences.
- Will likely never be range-able

~~Also I had a new idea, would it be possible to use a runtime finalizer to free the deadlocked goroutine? I'm unsure, but it could possibly work. I'll start testing something out.~~ (from a few tries at using finalizers, this doesn't seem possible. i'm not bothered by it since it's a pretty hacky solution anyway.)

jba commented 3 years ago

from a few tries at using finalizers, this doesn't seem possible

@ianlancetaylor used a finalizer to clean up a deadlocked goroutine in the generics proposal.

DeedleFake commented 3 years ago

Relevant code:

// Ranger provides a convenient way to exit a goroutine sending values
// when the receiver stops reading them.
//
// Ranger returns a Sender and a Receiver. The Receiver provides a
// Next method to retrieve values. The Sender provides a Send method
// to send values and a Close method to stop sending values. The Next
// method indicates when the Sender has been closed, and the Send
// method indicates when the Receiver has been freed.
func Ranger[T any]() (*Sender[T], *Receiver[T]) {
    c := make(chan T)
    d := make(chan bool)
    s := &Sender[T]{values: c, done: d}
    r := &Receiver[T]{values: c, done: d}
    // The finalizer on the receiver will tell the sender
    // if the receiver stops listening.
    runtime.SetFinalizer(r, r.finalize)
    return s, r
}

// A Sender is used to send values to a Receiver.
type Sender[T any] struct {
    values chan<- T
    done   <-chan bool
}

// Send sends a value to the receiver. It reports whether any more
// values may be sent; if it returns false the value was not sent.
func (s *Sender[T]) Send(v T) bool {
    select {
    case s.values <- v:
        return true
    case <-s.done:
        // The receiver has stopped listening.
        return false
    }
}

// Close tells the receiver that no more values will arrive.
// After Close is called, the Sender may no longer be used.
func (s *Sender[T]) Close() {
    close(s.values)
}

// A Receiver receives values from a Sender.
type Receiver[T any] struct {
    values <-chan T
    done  chan<- bool
}

// Next returns the next value from the channel. The bool result
// reports whether the value is valid. If the value is not valid, the
// Sender has been closed and no more values will be received.
func (r *Receiver[T]) Next() (T, bool) {
    v, ok := <-r.values
    return v, ok
}

// finalize is a finalizer for the receiver.
// It tells the sender that the receiver has stopped listening.
func (r *Receiver[T]) finalize() {
    close(r.done)
}

Something to note: This method of doing it makes it unsafe to give the raw channel to the user, which in turn means that the user can't directly use it with a select or with range.

Edit: Also, if the documentation for runtime.SetFinalizer() is correct, the code above has a bug, as the function passed to it above is a func(), but it expects a func(*Receiver). Hmmm... Makes me wonder if it could be changed to SetFinalizer[T constraints.Pointer[E], E any](obj T, finalizer func(T)). That would break the allowance for any number of ignored return values for finalizer, though.

deanveloper commented 3 years ago

Something that's neat about that Sender/Receiver pattern is that it satisfies func() (T, bool), so it works with the original proposed iterator. We could move Generator from package chans to package iterator, and redefine it like so:

package iterator

func Generator[T any](generator func(yield func(T) bool)) func() (T, bool) {
    send, recv := chans.Ranger()

    go func() {
        generator(send.Send)
        send.Close()
    }()

    return recv.Next
}

We could also add in the buffering that @jba mentioned to get a significant speedup, assuming that we don't get any goroutine/channel performance improvements that make it necessary. Even with the performance improvements, it will likely be beneficial to iterate over slices instead of using a buffered channel.

DeedleFake commented 3 years ago

For a proper performance improvement, it could also be reimplemented at some point later the way that a bunch of other languages do it, thus getting rid of the channel and background thread, and the API could be left as is.

I'm still not thrilled about the need for yield to return a value to indicate that there's nothing receiving anymore, though. It would be much preferable for the generator function to somehow just return on its own, running deferred functions on the way. Would it make sense for the yield implementation that gets passed to just call runtime.Goexit() instead of returning something?

deanveloper commented 3 years ago

Would it make sense for the yield implementation that gets passed to just call runtime.Goexit() instead of returning something?

Unfortunately I can't seem to get it to work. I'm not sure what's preventing recv from being GC'd, feel free to take a crack at it yourself. It may be that the original implementation of Ranger never called the finalizer either, I never tested it. https://go2goplay.golang.org/p/pPFUHTQyqI9

(Or, the Go 1 version which I was testing on my local machine: https://play.golang.org/p/6SaBW8TmNsy)

DeedleFake commented 3 years ago

@deanveloper

I fixed it. It seems like the problem was that the use of a method value for the finalizer function resulted in an extra reference being kept around, so it never got garbage collected. I changed it to a method expression and it worked.

adonovan commented 2 years ago

[gri, Jan 20] 1. I'm not convinced that the suggested syntactic sugar (that's what it is, after all) buys that much. That said, if we chose this translation:
for {
   a1, a2, ... an, ok := f()
   if !ok { break }
   ...
}
we would get new iteration variables for each iteration which would address the variable capture problem without the need for a new range syntax.

The proposal should not attempt to solve the variable capture problem (in which "for x := range seq" creates a single variable x for the whole loop, which is sometimes undesirable). It would be surprising and inconsistent for the compiler to create one variable per loop if seq is a slice, but one variable per iteration if seq is a function. Also, if the variable escapes, it would turn a single allocation into a linear number of allocations.

I'm not a fan of this proposal. Although the range syntax is neat, it is confusing at first sight, unlike most Go constructs; and as @jba points out, it creates for the first time a function call where there is no call syntax, and doesn't address the great variety of iteration constructs that come up in practice.

rogpeppe commented 2 years ago

@tv42 wrote:

You can replace any closure by defining a type that captures the free variables, returning a method value on that type. That often leads to more debuggable & readable code, too.

I tend to disagree with the second sentence there. In my view a closure is significantly less debuggable than an interface value because there's no way to tell what's going on inside it: we can't find out what type of iterator it is, and there's no way to define other methods on it that could be used to help debugging.

I also think that having an explicit type is clearer. When I see a func() (T, bool) it's not immediately clear that to me that it's an iterator.

tv42 commented 2 years ago

@rogreppe I think you misunderstood what I was trying to say. I also think closures are harder to debug than explicit types.

sammy-hughes commented 2 years ago

In case it meaningfully develops the discussion, I'm actually a vote in favor of closures.

My introduction to Go was in context of an API controller, backed by SQL. The standard database/sql Rows type had an iterator that carried a "Next()bool" method and "Scan(...interface{}) error" method. It was convenient, straight forward, and if you pre-allocated appropriate to the return size, was blisteringly fast.

I adopted it as an iterator in a two-closure fashion such as below. It's entered generically, but typically I would have redefined it per declared type. I get that closures aren't considered ergonomic, but I dissent. Excepting generics, the following is a pattern that is in use, in production code.

type IterRead[T any] func()*T
type IterWrite[T any] func(*T) bool
type IterNext[T any] func() bool
type Data[T any] []T

func iterSeek[T any](s *[]T, i int) (int, bool) {
    switch {
    case s == nil: return i, false
    case len(*s) > i+1: return i+1, true
    default: return i, false
    }
}
func iterScan[T any](s *[]T, v *T, i int) (int, bool) {
    i, ok := iterSeek(s, i)
    if ok { *v = (*s)[i] } // could just pointer+SizeOf(T)...I guess I haven't played with unsafe+generics since the preview on 1.17
    return i, ok
}

func Map(dP *Data[T]), f func(*T) T, z func() T) (IterRead[T], IterNext[T]) {
    var v T
    i, ok := -1, false
    return (
        func() *T { return &f(v) },
        func() bool {
            if ok { i, ok = iterScan(dP, &v, i) }
            if !ok { v = z() }
            return ok
        }
    )
}

It might feel heavy, but it's fairly easy to bundle stuff up out of the way. When used with concrete types, performance is quite good, and a read/write version is just func(p []T, z func()T) (r func()*T, w func(*T) bool, n func()bool). Of course, that all extends to a struct type with methods, but the pattern with closures just feels like less work to manage.

for instance, supposing the above, to transform a slice of any inferable type:

f, n := Map(data, mapFunc, zeroFunc)
for n() {
  fmt.Printf("%T: %v", f(), f())
}

And considering that there's not actually a dependency on generics here, as I use the pattern with concretes, it's an example of something that absolutely works. It could as easily be used with generics as with interfaces, meaningful or empty interfaces, as everything is defined by the caller, and type-assertions can be performed as appropriate by the caller.

I don't know what coroutines would look like as implemented in Go, but this pattern serves me as a poor-mans thread-local coroutine. To performance, this gets better if a pattern following this model could be guaranteed-inlined. In terms of ergonomics, I am stoked for #21498 to be finalized. Even in "today Go", I think this pattern has potential. If you then want to associate those independent closures with a struct, sure. Instantiate an interface-literal, and attach them? Sure.

johan-bolmsjo commented 2 years ago

One thing I'd consider to be missing would be a way to inform the iterator that the caller is "finished" with the iterator, i.e. if the range loop were exited early for some reason.

That could be solved with something like this:

type RangeIterator[T any] func() (value T, ok bool)

type ResourceIterator[T any] interface {
        Begin() RangeIterator[T]
        Close()
}

iter := tree.Iter()
for i := range iter.Begin() {
        // Do stuff with i
}
iter.Close()

Note that the range aspect of the proposal would be unaltered. The cleanup is just built into the outer type according to its particular needs.

johan-bolmsjo commented 2 years ago

I'm in favor of this proposal. Iterating over user collections or generators is something that I've always felt is a bit clunky in Go. I'm a huge fan, but nothing is perfect.

Today I wrote yet another generator that did not feel right and was thinking about writing a proposal myself. Researching the open issues revealed that multiple proposals already exist. I'll leave my comments here instead in the hope that they may influencing the decision to do something about the situation.

I landed in two solutions myself of which one is identical to the "range over closure" proposal presented here.

The second solution that I've always had in mind is to just introduce "proper while loops" as they exist in C. This suggestion was shut down in https://github.com/golang/go/issues/21855 but I think the risk of errors would be eliminated by using a "while" keyword instead of overloading another behavior with "for". Such a while would accept the same forms as "if".

Example: Procedural approach

iter := tree.Iter()
while i, ok := iter.Next(); ok {
        // Do stuff with i
}

Not to side track this proposal with another one. I just wanted to mention this alternative to get it off my chest.

As already listed, I'm aware of thee common ways to iterate over things.

For loop with manual exit (most verbose).
Thee clause for loop, repeating the iterator stepping call.
The scanner idiom

I don't think channel based approaches are reasonable for performance reasons.

Iterating over collections is something that is done fairly often. Having support for expressing this naturally would make it easier to read code (and IMHO make it slightly more elegant).

bboreham commented 2 years ago

Folks following this proposal will likely be interested in discussion: standard iterator interface

ianlancetaylor commented 1 year ago

Please see the discussion at #56413, which extends this idea.

ianlancetaylor commented 1 year ago

Closing this proposal in favor of the discussion at #56413, which includes much of what is discussed here, and more.

golang / go

proposal: Go 2: function values as iterators #43557

Proposal: Function values as iterators

Motivation

Prior art

Proposal

Discussion and observations

19702 is locked, so can't comment there, but what about something like `runtime.Deadlocked() <-chan struct{}` used like this?

golang / go

proposal: Go 2: function values as iterators #43557

Proposal: Function values as iterators

Motivation

Prior art

Proposal

Discussion and observations

19702 is locked, so can't comment there, but what about something like runtime.Deadlocked() <-chan struct{} used like this?

19702 is locked, so can't comment there, but what about something like `runtime.Deadlocked() <-chan struct{}` used like this?