panic: send on closed channel

stefansaasen commented 3 weeks ago

Since upgrading from 0.6.0 to 0.7.0 I'm seeing the following error when shutting down the river client:

panic: send on closed channel

goroutine 63 [running]:
github.com/riverqueue/river/internal/jobcompleter.(*BatchCompleter).handleBatch(0x140001423c0, {0x101078580, 0x140006973c0})
        /Users/foo/go/pkg/mod/github.com/riverqueue/river@v0.7.0/internal/jobcompleter/job_completer.go:420 +0x2ec
github.com/riverqueue/river/internal/jobcompleter.(*BatchCompleter).Start.func1()
        /Users/foo/go/pkg/mod/github.com/riverqueue/river@v0.7.0/internal/jobcompleter/job_completer.go:314 +0x37c
created by github.com/riverqueue/river/internal/jobcompleter.(*BatchCompleter).Start in goroutine 1
        /Users/foo/go/pkg/mod/github.com/riverqueue/river@v0.7.0/internal/jobcompleter/job_completer.go:270 +0x120
make: *** [run-worker] Error 2

or

panic: send on closed channel

goroutine 324 [running]:
github.com/riverqueue/river/internal/jobcompleter.(*AsyncCompleter).JobSetStateIfRunning.func1()
        /Users/foo/go/pkg/mod/github.com/riverqueue/river@v0.7.0/internal/jobcompleter/job_completer.go:170 +0x110
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /Users/foo/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:78 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 273
        /Users/foo/go/pkg/mod/golang.org/x/sync@v0.7.0/errgroup/errgroup.go:75 +0x98
make: *** [run-worker] Error 2

I'm using the following context.Context when starting the client:

    ctx, stop := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
    defer stop()

...

    if err := riverClient.Start(ctx); err != nil {
        return fmt.Errorf("failed to start river client: %w", err)
    }

and the following code to shut the river client down (maybe I'm holding it wrong?):

    var wg errgroup.Group
    wg.Go(func() error {
        <-ctx.Done()
        logger.Info("shutting down river client...")
        // Stop fetching new work and wait for active jobs to finish.
        ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
        defer cancel()
        if err := riverClient.Stop(ctx); !errors.Is(err, context.Canceled) {
            return err
        }
        return nil
    })

So when the original context is canceled, the client gets 30 seconds to stop the jobs. This might be incorrect but worked with 0.6.0 and now I'm seeing the panic.

I've run git bisect and it seems that commit 0e573385 might have introduced this. I'm not familiar enough with the code to offer a suggestion. Happy to help though if you need anything.

Note: I haven't been able to reproduce this with a simple test case so it's probably something to do with my specific usage:

the workers use the same pool that the river client uses
there are multiple Queues configured
the worker use TCP to send emails via SMTP, the test mail server is subject to adverse network conditions using toxiproxy. So there is some blocking/waiting happening.

Happy to explore this further. I'd appreciate if you could give me some pointers so that I know what to look for.

bgentry commented 3 weeks ago

Thanks for the detailed report and sorry for the issue. It seems to be that you still have jobs running when your client is closed. River obviously shouldn’t panic in such situations, but meanwhile I would suggest some tweaks to your shutdown logic to decrease the likelihood of this happening.

Stop() is a very soft form of shutdown which will stop fetching new work, but will not interrupt running jobs in any way. If your jobs are consistently fast, you should see it exit cleanly from this call. But if you have jobs that run more than 30 seconds, things may still be active after you’ve waited 30s. This is why there’s a 2nd more aggressive shutdown option StopAndCancel which also cancels the work context of any running jobs.

If you call that after you’ve given up waiting, any properly written jobs (those that respect context cancellation) should immediately return and the client should complete shutdown quickly.

Workaround aside, we’ll dive into this bug ASAP!

stefansaasen commented 3 weeks ago

Thanks for your swift response, much appreciated!

The documentation suggests that the executed jobs inherit from the ctx that was used when the client was started:

// Run the client inline. All executed jobs will inherit from ctx:
if err := riverClient.Start(ctx); err != nil {
    // handle error
}

In my case that is the context that gets cancelled on syscall.SIGINT for example. So the jobs all get cancelled already as they inherit from that context.

You are suggesting that StopAndCancel is required for doing this, so I feel I'm using this wrong.

I assumed that everything gets shutdown based on context cancellation anyway:

ctx, stop := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
defer stop()

// Set up the pool etc.
if err := riverClient.Start(ctx); err != nil {
    // ...
}

But I guess the correct pattern would be more akin to:


// Set this up so that we can gracefully shutdown the client on e.g. `CTRL+C`.
ctx, stop := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
defer stop()

// Start the river client but use a _different_ context

go func() {
    // Use a different context _here_
    if err := riverClient.Start(context.Background()); err != nil {
        // ...
    }
}()

// Wait on the first context and _then_ call `Stop` and then `StopAndCancel`
var wg errgroup.Group
wg.Go(func() error {
    <-ctx.Done()
    // Now call `Stop` and then `StopAndCancel`

With the main difference being that the context used to start the river client is not cancelled but instead Stop and StopAndCancel is called.

Is that the expected usage? Setting the client up this way, does seem to mitigate the panic problem. On the other hand, if the initial context used to start the client gets cancelled on receiving one of the signals, should the client respect context cancellation as well?

bgentry commented 3 weeks ago

@stefansaasen jobs do inherit their work context from the one provided at Start(), so if you're already cancelling that then your jobs should be getting signaled the same way even without a StopAndCancel() call. Sorry for the confusion on that bit!

The reason for the two different shutdown modes is to allow the potential for a slightly more graceful first shutdown mode where no new jobs are fetched, but existing ones are allowed to continue. But if you cancel the context (or call StopAndCancel) the context is closed and your jobs should return ~immediately, with little opportunity to finish what they were doing.

In that case, it seems there are two issues:

Your jobs must be blocking on some other operation and not halting their work when the context is cancelled.
The underlying issue with this situation leading to a panic, which is for sure a big in River.

stefansaasen commented 3 weeks ago

Thanks @bgentry for the explanation.

I think the graceful approach of calling Stop first and then StopAndCancel later seems like a useful improvement on my end.

The jobs actually halt their work when the context is cancelled, so apart from not being a very graceful approach compared to Stop (where jobs can finish), I guess apart from the panic it works as expected.

Anyway, as I said, I'm definitely going to look into moving to Stop followed by StopAndCancel as that seems like the better approach for my use case.

Thanks for your work on River!

brandur commented 3 weeks ago

Hey @stefansaasen, this should be fixed up in v0.8.0. Thanks for reporting.

stefansaasen commented 3 weeks ago

Thanks a lot for the quick turnaround!

riverqueue / river

panic: send on closed channel #400