Closed bgentry closed 3 weeks ago
could this have been causing stuck jobs on graceful shutdown?
@elee1766 I believe so, we should add that to the changelog. The core issue is the completer (marks completed jobs in the db) wasn’t properly waiting for jobs to finish executing due to a refactor. So I think it could have resulted in some jobs not being marked as finished when they had in fact executed and returned.
Back in #258 / 702d5b2, the batch completer was added to improve throughput. As part of that refactor, it was turned into a startstop service that took a context on start. We took the care to ensure that the context provided to the completer was not the
fetchCtx
(cancelled onStop()
) but instead was the raw user-providedctx
, specifically to make sure the completer could finish its work even after fetches were stopped.This worked well if the whole shutdown process was done with
Stop
/StopAndCancel
, but it did not work if the user-provided context was itself cancelled outside of River. In that scenario, the completer would immediately begin shutting down upon cancellation, even without waiting for producers to finish sending it any final jobs that needed to be recorded. This went unnoticed until #379 / 0e57338 turned this scenario into a panic instead of a silent misbehavior, which is what was encountered in #400.To fix this situation, we need to use Go 1.21's new
context.WithoutCancel
API to fork the user-provided context so that we maintain whatever else is stored in there (i.e. so anything used by slog is still available) but we do not cancel this completer's context ever. The completer will manage its own shutdown when itsStop()
is called as part of all of the other client services being stopped in parallel.Fixes #400.