Closed jozuas closed 3 months ago
The jobs started executing with Oban.drain_queue/1
iex(tp@100.100.20.1)2> Oban.drain_queue(queue: :docx_jobs_queue)
19:17:45.595 [info] {"args":{"fields_to_extract":[],"source_file_key":"T8BcswX63nLlNXDFafSbig.pdf"},"id":886610,"meta":{},"system_time":1722442665595434716,"max_attempts":1,"queue":"docx_jobs_queue","worker":"TpData.DocX.Workers.Worker","source":"oban","event":"job:start","attempt":1,"tags":[]}
Do you have other queues actively running jobs? Is this the only one in a "stuck" state?
The jobs started executing with Oban.drain_queue/1
That's good. Draining runs the jobs in the calling process and doesn't rely on staging or producers.
At that point in time we had other queues, but none of them had jobs that were supposed to be run - nothing in available
for those other queues.
+1 Seeing the same problem, where jobs are just sitting there available and don't get processed unless i drain_queue :(
How many nodes are you running on? Do you have a heterogeneous setup with separate web
and worker
nodes? Please share your Oban config.
We have:
Ecto.Repo
. All API queries are specifically configured to use the read-only Ecto.Repo
.oban_peers
.Shared compile-time config
config :tp, Oban,
engine: Oban.Engines.Basic,
queues: [
apa_sources_queue: [limit: 4],
reference_sources_queue: [limit: 3],
bond_pipeline_queue: [limit: 1],
sitemap_builder_queue: [limit: 1],
docx_jobs_queue: [limit: 10]
],
repo: TpData.Repo
Run-time config for the job processing node
# Modules listing minimised to not leak business information
overnight_apa_modules = [
TpData.TraxApa
]
# Modules listing minimised to not leak business information
overnight_reference_modules = [
TpData.Gleif.Level1,
TpData.Gleif.Level2,
TpData.Gleif.LeiToIsin
]
# Modules listing minimised to not leak business information
overnight_document_modules = [
TpData.DocCollect.Esma,
TpData.DocCollect.Edgar
]
apa_source_cron = "30 2 * * *"
reference_source_cron = "00 2 * * *"
bond_pipeline_cron = "00 4 * * *"
sitemap_builder_cron = "00 */4 * * *"
config :tp, Oban,
plugins: [
{Oban.Plugins.Cron,
crontab:
Enum.concat([
Enum.map(overnight_apa_modules, fn mod ->
{apa_source_cron, TpData.Workers.OvernightApaCollection, args: %{module: mod}}
end),
Enum.map(overnight_reference_modules, fn mod ->
{reference_source_cron, TpData.Workers.OvernightReferenceCollection,
args: %{module: mod}}
end),
Enum.map(overnight_document_modules, fn mod ->
{reference_source_cron, TpData.Workers.OvernightDocumentsCollection,
args: %{module: mod}}
end),
[{bond_pipeline_cron, TpData.Workers.BondPipeline}]
])},
{Oban.Plugins.Pruner, max_age: 60 * 60 * 24 * 90},
{Oban.Plugins.Lifeline, rescue_after: :timer.minutes(60)}
]
Runtime config for web node (should be irrelevant)
config :tp, Oban,
plugins: [
{Oban.Plugins.Cron,
crontab:
Enum.concat([
[{sitemap_builder_cron, TpWeb.Workers.SitemapBuilder}]
])},
{Oban.Plugins.Pruner, max_age: 60 * 60 * 24 * 90},
{Oban.Plugins.Lifeline, rescue_after: :timer.minutes(30)}
]
The typical reasons that jobs won't process are:
Do you use the default logger? If so, do you see any stager:switch events or (notifier:switch events) logged?
Do you use the default logger? Here's my
Application.start/2
:ok = Oban.Telemetry.attach_default_logger()
If so, do you see any stager:switch events or (notifier:switch events) logged?
Oh, I found the following in my journalctl
{"message":"notifier only receiving messages from its own node, functionality may be degraded","status":"solitary","source":"oban","event":"notifier:switch"}
If the notifier is switching without the stager switching, then that could mean the leader node (the one issuing notices to queues that they have jobs) can't reach other nodes. Since you're using the default postgres
notifier, that means it can't send messages even to itself.
Are you using a pooler like pg_bouncer
by chance? Are you able to switch to the PG notifier?
I've been receiving reports of users of Pleroma that their jobs sometimes stop executing since the update to Oban v2.17
I have not experienced it yet so I have not been able to diagnose it.
edit: Single node, default notifier. The v12 migration was run to disable the triggers.
@feld Any particular version? Oban v2.17.0 came out in December 2023. Could it be related to Elixir/Erlang changes?
@sorentwo The first reports came after the update from Oban 2.13.6 to 2.17.10 (within the last month or so) as we were increasing our oldest supported Elixir version from 1.12 to 1.13. I'm observing no issues on Erlang 26.2.5 and Elixir 1.17.1, but the versions deployed by the users will vary. I'll start collecting them for you.
I'm observing no issues on Erlang 26.2.5 and Elixir 1.17.1, but the versions deployed by the users will vary. I'll start collecting them for you.
Great, thanks!
@sorentwo
If the notifier is switching without the stager switching, then that could mean the leader node (the one issuing notices to queues that they have jobs) can't reach other nodes. Since you're using the default
postgres
notifier, that means it can't send messages even to itself.
👀
Are you using a pooler like
pg_bouncer
by chance?
We are not using a pooler like pg_bouncer
, only Ecto pools.
Are you able to switch to the PG notifier?
I did, but it doesn't seem like anything changed. Here's example dev
config (postgresql on localhost without middleware like PGBouncer) and the message captured in single iEX
session:
12:13:28.542 pid=<0.893.0> [info] {"message":"job staging switched to local mode. local mode polls for jobs for every queue; restore global mode with a functional notifier","mode":"local","source":"oban","event":"stager:switch"}
iex(tp@127.0.0.1)1> Oban.config
%Oban.Config{
dispatch_cooldown: 5,
engine: Oban.Engines.Basic,
get_dynamic_repo: nil,
insert_trigger: true,
log: false,
name: Oban,
node: "tp@127.0.0.1",
notifier: {Oban.Notifiers.PG, []},
peer: {Oban.Peers.Postgres, []},
plugins: [],
prefix: "public",
queues: [
apa_sources_queue: [limit: 4],
reference_sources_queue: [limit: 3],
bond_pipeline_queue: [limit: 1],
docx_jobs_queue: [limit: 20]
],
repo: TpData.Repo,
shutdown_grace_period: 15000,
stage_interval: 1000,
testing: :disabled
}
I did, but it doesn't seem like anything changed.
If you don't have a functional distributed erlang cluster then the PG notifier won't actually work. The fact that it switched to polling mode implies it definitely can't exchange messages now. However, while polling mode isn't as efficient, it should get your jobs processing. Have your jobs been processing since this issue was opened?
There doesn't appear to be an Oban bug behind all of this, more of an environmental/configuration issue, and I'd like to move continued support to the Forum.
I think yesterday I managed to understand what was going on. We had an Oban queue with a max capacity of 10. Only 7 jobs were running at the time, but the 7 jobs were enough to cause 100% CPU utilisation (we are doing some CPU intensive OCR work). A colleague of mine thought there was some problem that only 7 jobs were running, so he did delete from oban_jobs where state = 'available'
followed by delete from oban_jobs where state = 'executing'
and reinserted all the jobs from a livebook. These deletes most likely caused the problem. Since I explained to him that it does not matter to us that the number of jobs are not equal to max queue size, we simply want the hardware to be maximally utilised, he hasn't run these deletes and we haven't had any problems with no jobs being run.
🤦 Deleting jobs straight out of the database will certainly do it. Let your colleague know they can use Oban.scale_queue/2
at runtime to reduce concurrency if needed.
Environment
2.17.12
16
elixir --version
)Erlang/OTP 26
,1.15.7
Current Behavior
~120k rows in DB with state
available
of the form:More context - it started with a colleague of mine noticing that only 3 jobs are running, after application restart, none of the jobs are running. I have since restarted both Postgres and the application, but that has not changed anything.
Expected Behavior
Expected behaviour is jobs in
executing
state.