Fix shutdown panics by separating completer context

bgentry commented 3 weeks ago

Back in #258 / 702d5b2, the batch completer was added to improve throughput. As part of that refactor, it was turned into a startstop service that took a context on start. We took the care to ensure that the context provided to the completer was not the fetchCtx (cancelled on Stop()) but instead was the raw user-provided ctx, specifically to make sure the completer could finish its work even after fetches were stopped.

This worked well if the whole shutdown process was done with Stop / StopAndCancel, but it did not work if the user-provided context was itself cancelled outside of River. In that scenario, the completer would immediately begin shutting down upon cancellation, even without waiting for producers to finish sending it any final jobs that needed to be recorded. This went unnoticed until #379 / 0e57338 turned this scenario into a panic instead of a silent misbehavior, which is what was encountered in #400.

To fix this situation, we need to use Go 1.21's new context.WithoutCancel API to fork the user-provided context so that we maintain whatever else is stored in there (i.e. so anything used by slog is still available) but we do not cancel this completer's context ever. The completer will manage its own shutdown when its Stop() is called as part of all of the other client services being stopped in parallel.

Fixes #400.

elee1766 commented 2 weeks ago

could this have been causing stuck jobs on graceful shutdown?

bgentry commented 2 weeks ago

@elee1766 I believe so, we should add that to the changelog. The core issue is the completer (marks completed jobs in the db) wasn’t properly waiting for jobs to finish executing due to a refactor. So I think it could have resulted in some jobs not being marked as finished when they had in fact executed and returned.

riverqueue / river

Fix shutdown panics by separating completer context #401