oban-bg / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://oban.pro
Apache License 2.0
3.37k stars 313 forks source link

Single DynamicCron with missing worker will crash GenServer and hang all DynamicCrons #1095

Closed gusbicalho closed 5 months ago

gusbicalho commented 5 months ago

Environment

Current Behavior

With Oban Pro 1.3.5: We had a DynamicCron using a worker module. At some point the worker module was deleted as we did not need it anymore. Nobody thought of deleting the DynamicCron job that used that worker, however everything kept working fine.

When we bumped Oban Pro to 1.4.7, all our DynamicCrons stopped running. Our logs show the following error, once a minute:

GenServer {Oban.Registry, {Oban, {:plugin, Oban.Pro.Plugins.DynamicCron}}} terminating
** (MatchError) no match of right hand side value: {:error, %RuntimeError{message: "unknown worker: MyDeletedWorker"}}
    (oban_pro 1.4.8) lib/oban/pro/plugins/dynamic_cron.ex:643: anonymous fn/1 in Oban.Pro.Plugins.DynamicCron.reload_entries/1
    (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
    (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
    (oban_pro 1.4.8) lib/oban/pro/plugins/dynamic_cron.ex:624: anonymous fn/2 in Oban.Pro.Plugins.DynamicCron.reload_and_insert/1

It looks like the Supervisor kept trying to restart the plugin, but every time it crashes on startup due to the missing worker module.

Deleting the Cron from the database solved the issue.

Expected Behavior

A single bad job should not break the entire cron system. The old behaviour from 1.3.5 was better; even better would be to log a warning and display in the Oban dashboard something about that broken Cron.

It would also be useful to have a guide on how to deal with changes to workers, maybe even a migration helper to remove them from the cron table.

sorentwo commented 5 months ago

It would also be useful to have a guide on how to deal with changes to workers, maybe even a migration helper to remove them from the cron table.

A guide is a gread idea, because it can be tricky to handle cleanly.

A migration to remove them from the cron table probably wouldn't work though. Not all nodes have a worker available, and migrations are typically ran before the rest of the application has booted, so there's no reliable way to tell what's missing.

As of v1.4 you can use sync_mode: :automatic to get automatic crontab syncing.