x/sync/errgroup: Group: document that Go may not be concurrent with Wait unless semaphore > 0

haaawk commented 2 days ago

(Edit: skip down to https://github.com/golang/go/issues/70284#issuecomment-2470418828; this is now a doc change request. --@adonovan)

Go version

go version go1.23.1 darwin/amd64

Output of `go env` in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/haaawk/Library/Caches/go-build'
GOENV='/Users/haaawk/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/haaawk/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/haaawk/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.23.1'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/haaawk/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/_y/nj5hcsh93l12tmsszxgx7ntr0000gn/T/go-build2396141927=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

When following code is run:

package main

import (
    "runtime"

    "golang.org/x/sync/errgroup"
)

func main() {
    runtime.GOMAXPROCS(1)
    g := &errgroup.Group{}
    g.SetLimit(1)
    ch := make(chan struct{})
    wait := make(chan struct{}, 2)
    g.Go(func() error {
        <-ch
        wait <- struct{}{}
        return nil
    })
    go g.Go(func() error {
        println("I'm not blocked")
        wait <- struct{}{}
        return nil
    })
    println("Ok let's play")
    close(ch)
    g.Wait()
    println("It's over already?")
    <-wait
    <-wait
}

https://go.dev/play/p/xTIsT1iouTd

What did you see happen?

The program printed:

Ok let's play
It's over already?
I'm not blocked

What did you expect to see?

The program printing:

Ok let's play
I'm not blocked
It's over already?

gabyhelp commented 2 days ago

Related Issues

Related Discussions

Potential Issue in errgroup

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

haaawk commented 2 days ago

I send a patch here https://go-review.googlesource.com/c/sync/+/627075

gopherbot commented 2 days ago

Change https://go.dev/cl/627075 mentions this issue: x/sync/errgroup: ensure all goroutines finish beforeWaitreturns

jrick commented 1 day ago

go g.Go(

I don't think this is what you meant to write.

haaawk commented 1 day ago

go g.Go(

I don't think this is what you meant to write.

It is exactly what I meant to write. Otherwise the g.Go would block.

jrick commented 1 day ago

Your program contains a data race, because it's not valid to increment a waitgroup with 0 count while waiting on it. Your submitted patch doesn't change this.

haaawk commented 1 day ago

Your program contains a data race, because it's not valid to increment a waitgroup with 0 count while waiting on it. Your submitted patch doesn't change this.

The waitgroup has count of 1 when the go g.Go is called

jrick commented 1 day ago

No. You would need additional synchronization to guarantee this.

package main

import (
        "runtime"

        "golang.org/x/sync/errgroup"
)

func main() {
        runtime.GOMAXPROCS(1)
        g := &errgroup.Group{}
        g.SetLimit(1)
        ch := make(chan struct{})
        wait := make(chan struct{}, 2)
        insideGo := make(chan struct{})
        g.Go(func() error {
                <-ch
                wait <- struct{}{}
                return nil
        })
        go g.Go(func() error {
                close(insideGo)
                println("I'm not blocked")
                wait <- struct{}{}
                return nil
        })
        println("Ok let's play")
        close(ch)
        <-insideGo
        g.Wait()
        println("It's over already?")
        <-wait
        <-wait
}

cherrymui commented 1 day ago

@jrick is right. go g.Go(...) is not what you want. g.Go is intentionally blocking to limit the number of active goroutines. You've called g.SetLimit(1), so that limits it to 1 active goroutine a time. If you don't want that limit, you can remove that line, or set a higher limit.

In your case, there is no synchronization between g.Wait and the g.Go call in the new goroutine created by go g.Go(...). If is possible that g.Wait has finished while the new goroutine created by go g.Go(...) hasn't run. So this works as intended. Your CL doesn't seem to change it either: g.Wait can still finish even before the new goroutine reaches g.Go.

In general, if you have questions about how to use the API, please see https://go.dev/wiki/Questions . Thanks.

adonovan commented 1 day ago

Perhaps I misunderstand, but I think the Group is working as intended. SetLimit may prevent the Group from accepting new work, and the client must deal with that. It definitely cannot call Go asynchronously as there would be no guarantee that it happened before the later call to Wait.

Perhaps the documentation could be improved, but I don't think there's a bug.

adonovan commented 1 day ago

Ah, @cherrymui beat me to it!

haaawk commented 1 day ago

This was just my failed attempt to have a simple reproducer to the issue that's still there. In the code below the order is always right and the program still prints the same result. The problem is that g.Go can wake up from sem in a group that was already waited on. Let me thing of less artificial example and come back. It will need to be more complex unfortunately.

package main

import (
    "runtime"
    "time"

    "golang.org/x/sync/errgroup"
)

func main() {
    runtime.GOMAXPROCS(1)
    g := &errgroup.Group{}
    g.SetLimit(1)
    ch := make(chan struct{})
    wait := make(chan struct{}, 2)
    g.Go(func() error {
        <-ch
        wait <- struct{}{}
        return nil
    })
    go g.Go(func() error {
        println("I'm not blocked")
        wait <- struct{}{}
        return nil
    })
    println("Ok let's play")
    time.Sleep(1 * time.Second)
    close(ch)
    g.Wait()
    println("It's over already?")
    <-wait
    <-wait
}

adonovan commented 1 day ago

In the code below the order is always right and the program still prints the same result.

This program has a race: the second call to Go races with Wait. Consider: the goroutine created by the first Go function could complete (along with println, Sleep, close) before the goroutine created by the go statement even begins to execute the second call to g.Go.

haaawk commented 1 day ago

In the code below the order is always right and the program still prints the same result.

This program has a race: the second call to Go races with Wait. Consider: the goroutine created by the first Go function could complete (along with println, Sleep, close) before the goroutine created by the go statement even begins to execute the second call to g.Go.

I know it's not properly synchronized and I never said it is. I said that the order is right. I wouldn't expect runtime to stay idle during sleep and not execute the runnable goroutine.

I guess a race free reproducer could be:

package main

import (
    "runtime"
    "runtime/pprof"
    "strings"
    "time"

    "golang.org/x/sync/errgroup"
)

func main() {
    runtime.GOMAXPROCS(1)
    g := &errgroup.Group{}
    g.SetLimit(1)
    ch := make(chan struct{})
    wait := make(chan struct{}, 2)
    go func() {
        println("Starting task scheduler")
        println("Scheduling first task")
        g.Go(func() error {
            <-ch
            println("First task ran")
            wait <- struct{}{}
            return nil
        })
        println("Scheduling second task")
        g.Go(func() error {
            println("What about me?")
            wait <- struct{}{}
            return nil
        })
    }()
    time.Sleep(1 * time.Second)
    var b strings.Builder
    pprof.Lookup("goroutine").WriteTo(&b, 1)
    close(ch)
    if strings.Contains(b.String(), "errgroup.go:71") {
        g.Wait()
        println("Game over")
    } else {
        println("Second task not blocked yet")
    }
    <-wait
    <-wait
}

https://go.dev/play/p/Zg5GpZJz2bK

but I have to admit, there's probably no way to reproduce the problem in a completely race free way. If g.Go and g.Wait are in two different goroutines then runtime can always preempt g.Wait in the middle and then schedule g.Go to run and set g.wg.Add(1) while g.Wait is running with counter equal to 0.

haaawk commented 1 day ago

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

adonovan commented 1 day ago

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

It's fine to call Go concurrently with Wait, and indeed useful, if one item of work might add others to the queue. But you can't use SetLimit in this scenario or else the Group will stop accepting work. We could certainly document that, but we should not try to disallow concurrent calls to Go.

haaawk commented 1 day ago

I guess it would be cleaner if the docs were saying explicitly that g.Wait has to be called only after all calls to g.Go finish. This is not obvious without looking into the implementation. One could imagine an implementation that does not have a data race when g.Wait and g.Go are called concurrently.

It's fine to call Go concurrently with Wait, and indeed useful, if one item of work might add others to the queue. But you can't use SetLimit in this scenario or else the Group will stop accepting work. We could certainly document that, but we should not try to disallow concurrent calls to Go.

It is a very unintuitive requirement which is really driven by implementation details not the best API design but you're right.

adonovan commented 1 day ago

It is a very unintuitive requirement which is really driven by implementation details not the best API design

I think your criticism of the design is unfair. The doc comment for SetLimit says that it "limits the number of active goroutines", and that "any subsequent call to the Go method will block ...". The alternative that you propose would remove the blocking behavior, which would be an incompatible change (since it removes happens-before edges), and it would require that Group maintain an arbitrarily large queue of functions provided to Go before a goroutine is available to run them, which could cause an application to use unbounded memory when dealing with very long streams of tasks.

haaawk commented 1 day ago

It is a very unintuitive requirement which is really driven by implementation details not the best API design

I think your criticism of the design is unfair. The doc comment for SetLimit says that it "limits the number of active goroutines", and that "any subsequent call to the Go method will block ...". The alternative that you propose would remove the blocking behavior, which would be an incompatible change (since it removes happens-before edges), and it would require that Group maintain an arbitrarily large queue of functions provided to Go before a goroutine is available to run them, which could cause an application to use unbounded memory when dealing with very long streams of tasks.

So I'm not advocating for my change any more at all. My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

adonovan commented 1 day ago

My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

True, but the exact same principle applies to plain old WaitGroups: to add an item to the group you call Add(1) and later Done. In the simple case you make all the Add calls in sequence and all the Dones happen asynchronously. In more complex cases a task begets more tasks, so it calls Add (for the child) before calling Done for itself, preserving the invariant. But you get into trouble if you make the Add asynchronous to the Wait without otherwise ensuring that the semaphore is nonzero.

Perhaps we could add some clarifying documentation.

haaawk commented 1 day ago

My criticism is about the fact that calling Wait and Go concurrently with no synchronization is a data race and the docs don't say a word about it. Calling Go from a goroutine running inside a group is a form of synchronization. The requirement is really specific - "You can call Go concurrently with Wait only if you have guaranteed that at least 1 other goroutine that is already running inside the group won't finish until after either Go or Wait finishes first - or both.

True, but the exact same principle applies to plain old WaitGroups: to add an item to the group you call Add(1) and later Done. In the simple case you make all the Add calls in sequence and all the Dones happen asynchronously. In more complex cases a task begets more tasks, so it calls Add (for the child) before calling Done for itself, preserving the invariant. But you get into trouble if you make the Add asynchronous to the Wait without otherwise ensuring that the semaphore is nonzero.

Perhaps we could add some clarifying documentation.

Yes but WaitGroup has the following in the docs:

Note that calls with a positive delta that occur when the counter is zero must happen before a Wait. Calls with a negative delta, or calls with a positive delta that start when the counter is greater than zero, may happen at any time. Typically this means the calls to Add should execute before the statement creating the goroutine or other event to be waited for. If a WaitGroup is reused to wait for several independent sets of events, new Add calls must happen after all previous Wait calls have returned.

which gives a user a chance to use the API correctly without looking into its implementation.

haaawk commented 1 day ago

BTW I came up with this issue because I found in the codebase I'm working on the following code outside of group goroutines:

    if !w.g.TryGo(f) {
        go w.g.Go(f)
    }

and it seemed fishy from the concurrency point of view so I kept digging. Apparently someone was misled by errgroup.Group docs/API.

adonovan commented 14 hours ago

Reopening as a doc change request.

golang / go