sorentwo / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://getoban.pro
Apache License 2.0
3.17k stars 297 forks source link

Log when jobs don't finish executing within `:shutdown_grace_period` #1076

Closed axelson closed 2 months ago

axelson commented 2 months ago

If jobs take longer than :shutdown_grace_period (by default 15 seconds) then they are brutally killed. Log a message when this occurs. The message looks like:

Oban's :alpha queue was unable to cleanly shutdown in allotted time (:shutdown_grace_period was 10). Remaining job ids: (job ids: [1941])

This is important because the majority of the time in a running you usually want your jobs to finish within the :shutdown_grace_period and with this new log message it is now easier to detect when that doesn't happen and which jobs were affected which will allow developers to then improve those jobs so that they finish in time (e.g. by splitting a job into multiple units of work).

I pass through the conf.name to Oban.Queue.Watchman so that it is able to check the appropriate queue status (which is especially important in the tests). I'm not in love with conf_name but I went with it because name was already taken. As an aside I think the typespec for name is incorrect, it is currently module() but it is actually receiving a via tuple like {:via, Registry, {Oban.Registry, {Oban, {:producer, "alpha"}}}}

I've added minimal test updates 9enough to verify the new behavior), if this PR looks good then I can make additional updates if needed.

axelson commented 2 months ago

The test failure doesn't look related to the PR ("test cron jobs are scheduled using the configured timezone (Oban.Plugins.CronTest)")

sorentwo commented 2 months ago

@axelson Thanks for opening this PR and kicking off this work. The linked commit was inspired by your efforts, and gives some attribution to this PR.

The approach in e7f91f4 differs in a few substantial ways:

  1. Emit a telemetry event rather than logging from the watchman. That can be attached by other instrumentation, including alerting systems.
  2. Add a handler for that event in the default logger, like it does for notifier, peer, and staging changes.
  3. Lightly extend the watchman's shutdown period, just long enough to emit the telemetry event rather than adding a quarter of a second to the total shutdown time for each queue.
axelson commented 1 month ago

Ah, that makes sense. I didn't think about telemetry.

Lightly extend the watchman's shutdown period, just long enough to emit the telemetry event rather than adding a quarter of a second to the total shutdown time for each queue.

Yeah I wasn't sure how long would be needed to reliably emit the event.

Thanks for adding support for this!