proposal: runtime: garbage collect goroutines blocked forever

faiface commented 7 years ago

Hi everyone!

As of today, Go's runtime does not garbage collect blocked goroutines. The most cited reason being, that goroutines blocked forever are usually a bug and that collecting them would hide this bug. I would like to show a few examples, where garbage collected goroutines would really be appreciated and lead to a much safer and less buggy code.

What does it mean?

package main

import "fmt"

func numbers() <-chan int {
    ch := make(chan int)
    go func() {
        for i := 0; ; i++ {
            ch <- i
        }
    }()
    return ch
}

func main() {
    for i := 0; i < 1000; i++ {
        for num := range numbers() {
            if num >= 1000 {
                break
            }
            fmt.Println(num)
        }
    }
}

This code has a memory leak in current Go. The function numbers returns a channel that generates an infinite sequence of natural numbers. We get this sequence 1000 times in the main function, print the first 1000 numbers of each sequence and quit. The numbers function spawns a goroutine that feeds the channel with numbers. However, once we print the first 1000 numbers produced by the goroutine, the goroutine stays in the memory, blocked forever. If the first for-loop iterated forever, we would run out of memory very quickly.

How to collect goroutines?

I'd suggest the following algorithm:

All non-blocked goroutines are marked active.
All channels not reachable by active goroutines are marked inactive.
All gouroutines blocked on an inactive channel are marked dead and garbage collected.

Edit: Based on the discussion, I add one detail to the implementation. A goroutine marked dead would be collected in a way that's identical to calling runtime.Goexit inside that goroutine (https://golang.org/pkg/runtime/#Goexit).

Edit 2: Based on the further discussion, runtime.Goexit behavior is debatable and maybe not right.

What are the benefits?

Once such goroutine garbage collection is enabled, we can solve large variety of new problems.

Infinite generator functions. Generator is a function that returns a channel that sends a stream of values. Heavily used in functional languages, they are currently hardly possible in Go (we'd have to send them a stopping signal).
Finite generator functions. We can do them in Go, however, we have to drain the channel if we want to avoid a memory leak.
Manager goroutines. A nice way to construct a concurrent-safe object in Go is to, instead of guarding it by a mutex, create a manager goroutine, that ranges over a channel that sends commands to this object. Manager goroutine then executes this commands as they come in. This is hard to do in Go, because if the object goes out of scope, manager goroutine stays in the memory forever.

All of the described problems are solvable. Manual memory management is also solvable. And this is just like that. Inconvenient, error-prone and preventing us from doing some advanced stuff, that would be really useful.

I'd like to point out, that the whole talk "Advanced Go Concurrency Patterns" would be pointless, if Go collected blocked goroutines.

What do you guys think?

bradfitz commented 7 years ago

This makes the garbage collector more expensive.

Also, how do you "mark dead" a goroutine? A goroutine can't be killed by another goroutine. Do you run deferred statements? Does the channel operation panic?

This would require a lot of design & implementation work for marginal benefit.

faiface commented 7 years ago

I'm not sure about the details, but it seems logical that the deferred statements would run and then the goroutine would be silently killed. It would not be killed by another goroutine, it would be killed by the garbage collector.

I personally don't think the benefits are marginal. It would make a lot of the obvious code work correctly.

If you watch Rob Pike's "Go Concurrency Patterns" talk, you see that his code examples would not be correct, if the goroutines would not be garbage collected like this. Also, in one point in the talk, being asked by some guy "What happens to that goroutine", he responds something like, "Don't worry, it'll be collected".

bradfitz commented 7 years ago

I'm not sure about the details,

The proposal process is really about the details, though.

but it seems logical that the deferred statements would run and then the goroutine would be silently killed.

What about locks it's currently holding? Would deferred statements be run?

What about the cases where it actually is a bug to have the leak and the goroutine shouldn't be silently swept under the rug?

Maybe an alternate proposal would be that the sender goroutine should have to explicitly select on the receiver going away:

func numbers() <-chan int {
    ch := make(chan int)
    go func() {
        for i := 0; ; i++ {
            select {
                        case ch <- i:
                        unreachable:  // language change, though.
                                // choose: panic, return silently, etc
                        }
        }
    }()
    return ch
}

But really, now that we have context in the standard library, the answer should be to start the generator goroutine with a context.

faiface commented 7 years ago

I think that deferred calls should be run. If one of the deferred calls is a mutex unlock, than it'll be run. If the blocked goroutine holds a mutex, but no deferred call unlocks that mutex, then it's a bug and although the goroutine will be collected, the mutex won't be unlocked which would definitely show the bug, just as good as a goroutine memory leak.

On the Context: of course, it's possible to use here, however, you have to manually cancel it. It's just like manual memory management.

bradfitz commented 7 years ago

Okay, so your proposal is that a channel send or receive operation that can be dynamically proven to never complete turns into a call to runtime.Goexit (https://golang.org/pkg/runtime/#Goexit)?

faiface commented 7 years ago

Exactly, yes.

randall77 commented 7 years ago

I don't agree with calling Goexit. If a goroutine is blocked forever, we can just garbage collect it without changing any semantics (and maybe sample it or something, for the bug case). It would never have run again, so there isn't an observable difference except for memory footprint. Why would it call Goexit, run defers, etc.? That seems like an unexpected behavior while waiting on a channel. And it feels like finalizers - we can only provide nebulous guarantees about whether we can detect it and how quickly we do so.

Holding a lock while doing a channel operation should be heavily frowned upon :(

faiface commented 7 years ago

@randall77 You've got a valid point. This would need to be discussed.

fraenkel commented 7 years ago

How would goroutines waiting on Tickers or Timers work? What about channels in channels?

faiface commented 7 years ago

@fraenkel Tickers and Timers theoretically always have a running goroutine behind them, counting the time, which eventually sends a signal. So, they're always active. Blocking on them doesn't garbage collect anything.

Channels in channels don't make any difference, if I'm not missing anything. If a channel is not reachable by any active goroutine, then no values will ever be sent or received on it.

nullchinchilla commented 7 years ago

If this is implemented (and I think it indeed would be useful), deferred functions should not run. Essentially this should not change the observable behavior of any program except for memory usage.

I do feel like this feature would be very helpful, as it allows goroutines essentially to be used as continuations (from languages like Scheme) that contain some state that could be resumed or thrown away implicitly. A goroutine in the background managing some variables local to that goroutine could replace structs in cases where you really do not want to expose any internal structure, or you are doing some complex synchronization.

nullchinchilla commented 7 years ago

In fact, it might be a good idea to panic if the garbage collector wants to collect an unreachable goroutine but finds that it has pending defers (or more conservatively, is holding a lock), since that's always a bug.

randall77 commented 7 years ago

I think this could work, but it would require folding goroutines into the marking phase of the GC. Instead of treating all goroutines as roots, start with only active goroutines. (Active here means not blocked on a channel. Goroutines in syscalls are considered active for this.) Any time we mark a channel, mark all the goroutines waiting on that channel as active, and schedule their stacks for scanning. At the end of mark, any goroutines not marked active can be collected. As a bonus, we never need to scan their stack (and we can't, because then we would find a reference to the channel they are waiting on).

There are lots of tricky details:

Channels have direction - if an active goroutine has a send-only reference to a channel, it could never wake up a goroutine receive-waiting on that channel. The GC has no way to tell different channel references apart (send/receive/bidirectional is only a compile-time thing).
The GC needs to know that channels (or possibly something the channel references, like the goroutine struct) are special objects. It needs that information to know that when marking the channel, it needs to do extra work it doesn't need to do for normal objects. We'd need to allocate channels on segregated pages or something. or maybe marks similar to how we do finalizers?
Goroutines could be selecting on several channels at once. Any one of those could cause it to be active.

On a more meta level, this proposal would encourage using forever-blocking goroutines during the normal course of business. Right now they are considered a bug. The proposal suggests this is just a defense against memory leaks. But this is a slippery slope people are surely going to drive a mac truck down. Lazy-evaluating Haskell interpreter, anyone? Go isn't really a good fit for that, as a goroutine is a really heavy future implementation. But people will try anyway, and get frustrated with the performance.

fraenkel commented 7 years ago

If one happens to write incorrect code which creates deadlocks or goroutines which are "dead" for some reason, one cannot easily determine what went wrong since the evidence is now magically removed.

faiface commented 7 years ago

@fraenkel It seems to me that deadlocks could easily be handled as a special case.

nullchinchilla commented 7 years ago

@randall77 I already use goroutines heavily as an annoying future implementation that requires manual memory management. The whole point of goroutines being extremely lightweight and scalable is that doing this is not "really heavy". If goroutines are just used where you would use threads in Java, etc, then it has lost some of its utility.

nullchinchilla commented 7 years ago

For example, I often use goroutines to express a sequence of numbers, such as an allocator for object IDs, to avoid intertwining the logic for incrementing counters, etc with the main loop where the business logic happens.

Far more frequently, I use goroutines to manage an object that requires complex synchronization. Essentially, in the function that creates the object ("NewWhateverStruct(...)" etc), a goroutine will be spun up in the background that communicates with the methods through channels and does all the actual work. This can include objects that do not manage external resources; a large in-memory thread-safe database for example. Currently, users of such an object must call a "Close()" method or something to kill the goroutines running in the background, which is annoying and easy to mess up, especially when the object may be referenced many times throughout many goroutines.

bradfitz commented 7 years ago

@randall77, you could imagine doing exactly what you propose in the GC, but instead of doing anything else to kill the goroutine, just ~sync.Once a warning about a leaked goroutine to stderr. The runtime doesn't complain to stderr often, but this seems justified enough.

docmerlin commented 7 years ago

This would also allow some of the more common erlang patterns to be useful in go.

egonelbre commented 7 years ago

It's possible to detect when a goroutine is blocked for more than some time by examining the stack (proof-of-concept https://github.com/egonelbre/antifreeze/blob/master/monitor.go). As such, one of the approaches could be that you can specify a time limit for how long a goroutine can be blocked.

With regards to the example, this is faster, shorter and doesn't require an additional goroutine:

type Numbers struct{ next int64 }
func (n *Numbers) Next() int { return int(atomic.AddInt64(&n.next, 1)) }

func main() {
    for i := 0; i < 10; i++ {
        var numbers Numbers
        for {
            num := numbers.Next()
            if num >= 1000 {
                break
            }
            fmt.Println(num)
        }
    }
}

nullchinchilla commented 7 years ago

@egonelbre Using a timeout would be horrible, as it will either be too long or too short for some purposes. Especially accidentally timing out a goroutine which is still reachable would cause weird unpredictable bugs.

Integrating goroutine-collecting with the GC would make sure that all the collecting is completely "unobservable" without dumping stack traces. And as @DocMerlin mentions, it does significantly increase the expressivity of Go compared to languages like Erlang with somewhat similar concepts of communicating lightweight processes.

Merovius commented 7 years ago

I was missing this for a long time too and still don't quite understand, why memory is considered a resource that a programmer should obviously not need to care about, while goroutines are (cf. "What about the cases where it actually is a bug to have the leak and the goroutine shouldn't be silently swept under the rug?" - I could just as easily ask "what about cases where a leaked pointer is actually a bug and the GC is sweeping that under the rug?").

That being said, I believe a) this proposal so far to be too hand-wavy. I think there are a lot of questions to be answered about the details of what goroutines can and can't be collected. And as this feature is only useful, once it's codified in the spec (otherwise a program can't rely on the collection, thus any actual use of it would still be incorrect), these questions would need good, complete but also concise answers. And b) that indeed, context seems like a good enough solution to at least all cases I care about. In 99% of code, I need to plumb through a context anyhow and rely on it being cancelled upstack.

So, it would be cool, if this could happen, but the edge-cases should definitely be thought hard about.

faiface commented 7 years ago

@Merovius Fully agree. This proposal is not meant to be something complete, I was rather going for to see whether other people were missing this too.

I think it would be useful if some more people from the Go team stated their opinion on this (such as @griesemer, @robpike, @adg, etc. not mentioning @bradfitz, who already contributed).

If the proposal turns out to be reasonable and acceptable, then we might start thinking about the details in depth.

bradfitz commented 7 years ago

The @golang/proposal-review group meets on Mondays, but generally doesn't comment on proposals until they've been open with community input for a week.

faiface commented 7 years ago

@bradfitz Aaa, thanks, I didn't know that. Does that mean that this proposal will be discussed on the Monday in two weeks?

bradfitz commented 7 years ago

It depends on backlog.

nullchinchilla commented 7 years ago

@Merovius context is similar to manually managing memory using free in my mind, especially in code that does not use the context package pervasively, or even for which a workflow involving context makes no sense (say, an in-memory data structure package)

nullchinchilla commented 7 years ago

@Merovius It seems like there's a very simple way of defining "unreachable":

A goroutine is unreachable if it is waiting on a resource that is unreachable (in the GC sense) from any other goroutine stack.

randall77 commented 7 years ago

@bunsim: not quite.

A goroutine is unreachable if it is waiting on a resource that is unreachable (in the GC sense) from any other goroutine stack.

... from any other reachable goroutine's stack.

Which of course is a recursive definition. GC marking solves exactly this problem for objects.

nullchinchilla commented 7 years ago

Ah yes, you could have a situation where a bunch of unreachable goroutines wait on each other.

docmerlin commented 7 years ago

Which is why doing this from the GC makes sense On Sat, Mar 25, 2017 at 15:14 Keith Randall notifications@github.com wrote:

@bunsim https://github.com/bunsim: not quite.

A goroutine is unreachable if it is waiting on a resource that is unreachable (in the GC sense) from any other goroutine stack.

... from any other reachable goroutine's stack.

Which of course is a recursive definition. GC marking solves exactly this problem for objects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/19702#issuecomment-289236550, or mute the thread https://github.com/notifications/unsubscribe-auth/ABLfWwFj-0ZeGnWWmHRbgAllZwlgCdY3ks5rpXWJgaJpZM4MoqnK .

Merovius commented 7 years ago

@bunsim I believe comparing cancelling contexts with using free is a false equivalency. The closest equivalency to memory-management, I believe, would be having a pool-allocator, passing that down-stack and free it as a whole once the request is finished, precisely eliminating the need to call free. In general, your http/rpc/whatever server should call cancel for you at some point, eliminating the need for you to do it. Now, that doesn't help with code that isn't request oriented, but that's why I emphasized the "I" in my post; personally, I just can't think of any code I would write that a) would care about a goroutine-leak because it's long-running and b) isn't some sort of server with context-scoped work.

In response to your definition of unreachable, waving your hands more vigorously isn't really progress. There are selects to think about, read-only channels, write-only channels, sending channels over channels, changing the values of channel variables, interfaces… I'm not saying, that there isn't a good precise definition to be given that covers all these cases. But I also wouldn't be surprised if solving this problem precisely would turn out to be equivalent to solving the halting problem. It's just very hard to tell, as long as the level of discussion is this high-level.

It is still worthwhile to think about even partial heuristics that can be employed (it would probably be the first step anyway) without codifying it in the spec, just as a clutch to contain the effects of bugs, but as I said, any such heuristic couldn't be relied upon. So as for correctness you can't assume them anyway, making them much less interesting as a feature for the average go-user, I think. At least for the next couple of releases, I, at least, am going to contain my excitement here.

nullchinchilla commented 7 years ago

@Merovius There are other languages that support this feature, and "collecting infinitely blocking threads" is something that is already an established concept. For example, Racket does exactly this, with a well-defined specification:

A thread that has not terminated can be garbage collected (see Garbage Collection)
 if it is unreachable and suspended or if it is unreachable and blocked on only
 unreachable events through functions such as semaphore-wait, 
semaphore-wait/enable-break, channel-put, channel-get, sync, sync/enable-break,
 or thread-wait. Beware, however, of a limitation on place-channel blocking; see the
 caveat in Places.

Racket has an much more complex system of channel-like constructs ("events") than Go's channel select --- everything from events to file descriptors to channels to user-defined composite events could be put in a select-equivalent --- but yet that paragraph reasonably fully describes how the GC collects unreachable threads.

It is certainly not equivalent to solving the halting problem, and is similar in difficulty to memory GC.

This feature would be pretty useless and indeed "bug-hiding" if it were just used to fix up broken programs that leak goroutines. It is certainly possible to implement this as a strict guarantee with strict definitions of unreachability, and in fact that would be the only way I would support this proposal.

zigo101 commented 7 years ago

Will this proposal keep Go 1 compatibility? For example:

func numbers(p *T) <-chan int {
    ch := make(chan int)
    go func() {
        for i := 0; ; i++ {
            ch <- i
        }

        runtime.KeepAlive(p)
    }()
    return ch
}

or a simpler one

func SillyKeepAlive(p *T) {
    select{}

    runtime.KeepAlive(p)
}

if the generator goroutine is collected, then p will be too, which may be not the expected.

mattn commented 7 years ago

We don't know whether the chan will be read in later. code only know that the chan should be closed or not. So exit groutine explicitly(using context/cancel) or using time-out is better, I think.

select {
case ch <- i:
case <-time.After(time.Second):
    close(ch)
    return // exit goroutine
}

zigo101 commented 7 years ago

Another reason to reject this proposal: often, I want to block a goroutine for ever deliberately, for example, to avoid the main goroutine exiting. This proposal will make this impossible.

faiface commented 7 years ago

@golang101 This argument doesn't seem very reasonable because:

A blocked unreachable goroutine will not stop the main goroutine from exiting.
A blocked unreachable goroutine does nothing and will never do anything. Collecting it makes no difference except for memory consumption.

zigo101 commented 7 years ago

So blocked-for-ever main goroutine is not a blocked unreachable goroutine? Not unreasonable.

There should be at least one more condition to support this proposal: there is no other heap value references after the blocking point. The example in https://github.com/golang/go/issues/19702#issuecomment-289259906 obviously doesn't satisfy this condition.

nullchinchilla commented 7 years ago

@golang101 I also cannot easily see a situation where your code with runtime.KeepAlive would result in different output with goroutine-collecting on, and goroutine-collecting off.

Can you construct a simple program that uses runtime.KeepAlive in a way so that the program outputs one thing when goroutines are collected, and another thing when goroutines are not collected? runtime.KeepAlive is weird since essentially the only time where it ever affects observable behavior is in a small handful of cases related to syscalls and finalizers.

randall77 commented 7 years ago

This proposal does cause finalizers to run that would never have otherwise.

nullchinchilla commented 7 years ago

True, but when do finalizers run is already something that's poorly defined and implementation specific. I wouldn't imagine any "real code" breaking if goroutines are collected; indeed right now infinitely blocking goroutines are a memory leak so they are unlikely to occur in non-buggy code.

RLH commented 7 years ago

Unfortunately those finalizers will have access to otherwise blocked channels.

ch := make(chan int) runtime.SetFinalizer(obj, func () { doSomethingWithChannel(ch) })

On Sun, Mar 26, 2017 at 12:09 PM, Keith Randall notifications@github.com wrote:

This proposal does cause finalizers to run that would never have otherwise.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/19702#issuecomment-289293924, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7Wn5Uh-SAjbW7x6pM2uwZZfY8nZ3-7ks5rpo23gaJpZM4MoqnK .

faiface commented 7 years ago

@randall77 I'd actually say that if you have an object, that is kept from being finalized by a goroutine that's blocked forever, I'd say that that's always a bug and a memory leak and an unintended behaviour. Correct me if I'm wrong.

randall77 commented 7 years ago

@RLH : objects with finalizers are GC roots, so if a finalizer can get to a channel, anyone waiting on that channel would not get GCd. So a goroutine could still block forever without getting GCd if it holds a reference to an object whose finalizer can reference the channel it is forever blocked on.

@faiface: probably. Just another data point to consider. We're skirting the edge of the Go1 compatibility promise.

mattn commented 7 years ago

I just pointed an example of going through an infinite loop in the groutine. My opinion is explicitly exit groutine (it self). Because the chan may be read in later.

func main() {
    for i := 0; i < 1000; i++ {
        n := numbers() {
        for num := range n {
            if num >= 1000 {
                break
            }
            fmt.Println(num)
        }
        go doSomething(n)
    }
}

finalizer seems good for me but it may be leak.

faiface commented 7 years ago

@mattn I didn't quite get your point here... If the channel can be used later, than it's not unreachable/inactive, so the goroutine sending values on it wouldn't be collected. A channel is unreachable/inactive only in the case, when it's not reachable by following references/pointers from an active goroutine.

mattn commented 7 years ago

@faiface

If the channel can be used later, than it's not unreachable/inactive

I just mean it's harder to detect whether the chan will be unreachable/inactive or not. (, without finalizer)

nullchinchilla commented 7 years ago

Indeed, if goroutines holding references to finalizers will not be collected even if they are blocked indefinitely, it would be very surprising behavior for users, since finalizers might even be set by code in other packages. Goroutine collection would not be a good idea if programmers have to guess whether or not the code they wrote leaks goroutines.

Finalizers are also often used to call free on C-managed memory, in which case running them does make sense when collecting goroutines. I propose that we simply say that the existence of finalizers do not matter, and when goroutines are collected they may be run even if they wouldn't be run otherwise. This slightly breaks compatibility, but literally every change to the GC slightly breaks compatibility w.r.t. when do finalizers run. The introduction of runtime.KeepAlive already broke compatibility to a greater degree, IMO.

Merovius commented 7 years ago

@bunsim As the current guarantee re finalizers basically are "they might not run, even if an object is unreachable", I disagree with your characterization. The guarantee allows running fewer or more finalizers of objects that are unreachable. It does not, however, allow running finalizers for objects that are (logically) reachable.

Anyway, I also don't think it's a particularly important breakage. Finalizers are broken anyway; and it seems we were fine adding just such a breakage recently (and introducing runtime.KeepAlive as a poor patch for that).

nullchinchilla commented 7 years ago

Exactly. runtime.KeepAlive already breaks the connection between "logical" reachability and finalizers running, so adding goroutine collection will almost certainly not break anybody's correctly written program.

golang / go