Closed stefansaasen closed 3 weeks ago
Thanks for the detailed report and sorry for the issue. It seems to be that you still have jobs running when your client is closed. River obviously shouldn’t panic in such situations, but meanwhile I would suggest some tweaks to your shutdown logic to decrease the likelihood of this happening.
Stop()
is a very soft form of shutdown which will stop fetching new work, but will not interrupt running jobs in any way. If your jobs are consistently fast, you should see it exit cleanly from this call. But if you have jobs that run more than 30 seconds, things may still be active after you’ve waited 30s. This is why there’s a 2nd more aggressive shutdown option StopAndCancel
which also cancels the work context of any running jobs.
If you call that after you’ve given up waiting, any properly written jobs (those that respect context cancellation) should immediately return and the client should complete shutdown quickly.
Workaround aside, we’ll dive into this bug ASAP!
Thanks for your swift response, much appreciated!
The documentation suggests that the executed jobs inherit from the ctx
that was used when the client was started:
// Run the client inline. All executed jobs will inherit from ctx:
if err := riverClient.Start(ctx); err != nil {
// handle error
}
In my case that is the context that gets cancelled on syscall.SIGINT
for example. So the jobs all get cancelled already as they inherit from that context.
You are suggesting that StopAndCancel
is required for doing this, so I feel I'm using this wrong.
I assumed that everything gets shutdown based on context cancellation anyway:
ctx, stop := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
defer stop()
// Set up the pool etc.
if err := riverClient.Start(ctx); err != nil {
// ...
}
But I guess the correct pattern would be more akin to:
// Set this up so that we can gracefully shutdown the client on e.g. `CTRL+C`.
ctx, stop := signal.NotifyContext(ctx, syscall.SIGINT, syscall.SIGTERM, syscall.SIGQUIT)
defer stop()
// Start the river client but use a _different_ context
go func() {
// Use a different context _here_
if err := riverClient.Start(context.Background()); err != nil {
// ...
}
}()
// Wait on the first context and _then_ call `Stop` and then `StopAndCancel`
var wg errgroup.Group
wg.Go(func() error {
<-ctx.Done()
// Now call `Stop` and then `StopAndCancel`
With the main difference being that the context used to start the river client is not cancelled but instead Stop
and StopAndCancel
is called.
Is that the expected usage? Setting the client up this way, does seem to mitigate the panic
problem. On the other hand, if the initial context used to start the client gets cancelled on receiving one of the signals, should the client respect context cancellation as well?
@stefansaasen jobs do inherit their work context from the one provided at Start()
, so if you're already cancelling that then your jobs should be getting signaled the same way even without a StopAndCancel()
call. Sorry for the confusion on that bit!
The reason for the two different shutdown modes is to allow the potential for a slightly more graceful first shutdown mode where no new jobs are fetched, but existing ones are allowed to continue. But if you cancel the context (or call StopAndCancel
) the context is closed and your jobs should return ~immediately, with little opportunity to finish what they were doing.
In that case, it seems there are two issues:
Thanks @bgentry for the explanation.
I think the graceful approach of calling Stop
first and then StopAndCancel
later seems like a useful improvement on my end.
The jobs actually halt their work when the context is cancelled, so apart from not being a very graceful approach compared to Stop
(where jobs can finish), I guess apart from the panic
it works as expected.
Anyway, as I said, I'm definitely going to look into moving to Stop
followed by StopAndCancel
as that seems like the better approach for my use case.
Thanks for your work on River!
Hey @stefansaasen, this should be fixed up in v0.8.0. Thanks for reporting.
Thanks a lot for the quick turnaround!
Since upgrading from
0.6.0
to0.7.0
I'm seeing the following error when shutting down the river client:or
I'm using the following
context.Context
when starting the client:and the following code to shut the river client down (maybe I'm holding it wrong?):
So when the original context is canceled, the client gets 30 seconds to stop the jobs. This might be incorrect but worked with
0.6.0
and now I'm seeing the panic.I've run
git bisect
and it seems that commit 0e573385 might have introduced this. I'm not familiar enough with the code to offer a suggestion. Happy to help though if you need anything.Note: I haven't been able to reproduce this with a simple test case so it's probably something to do with my specific usage:
Queue
s configuredHappy to explore this further. I'd appreciate if you could give me some pointers so that I know what to look for.