Oban OSS available jobs are not being executed

jozuas commented 1 month ago

Environment

Oban Version 2.17.12
PostgreSQL Version 16
Elixir & Erlang/OTP Versions (elixir --version) Erlang/OTP 26, 1.15.7

Current Behavior

iex(tp@100.100.20.1)1> Oban.check_queue(queue: :docx_jobs_queue)
%{
  name: Oban,
  node: "tp@100.100.20.1",
  running: [],
  queue: "docx_jobs_queue",
  started_at: ~U[2024-07-31 15:42:29.405865Z],
  limit: 10,
  uuid: "d4144b19-65ce-4fac-a811-bc8262024359",
  updated_at: ~U[2024-07-31 15:48:08.671328Z],
  paused: false,
  refresh_interval: 30000
}

~120k rows in DB with state available of the form:

id  state   queue   worker  args    errors  attempt max_attempts    inserted_at scheduled_at    attempted_at    completed_at    attempted_by    discarded_at    priority    tags    meta    cancelled_at
1014124 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""Nbm3Tpfv7Xq3kEJCLu8CxA.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014123 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""dZqioxW4p0XATvq8MJtVyQ.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014122 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""MpF3r_GyNkUxDykLW0drfQ.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014121 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""xu2pYncWGHq7jMyCgmHtoQ.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014120 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""-5LLpSXadCJLE8KBhBqJCA.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014119 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""3EPHsSAw3NKH0EMuvknijw.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014118 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""R41XlXgicV2qPn0-zB8glw.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014117 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""nlLLgaLLhL3R0_uS1OlFgg.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014116 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""OXyoMU08RGS2MsemhNcn3Q.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}  
1014115 available   docx_jobs_queue TpData.DocX.Workers.Worker  "{""source_file_key"": ""NlDu7cWqwMs5Nlbx49CEIg.pdf"", ""fields_to_extract"": []}"  {}  0   1   2024-07-31 09:49:42.354352  2024-07-31 09:49:42.354352                  0   {}  {}

More context - it started with a colleague of mine noticing that only 3 jobs are running, after application restart, none of the jobs are running. I have since restarted both Postgres and the application, but that has not changed anything.

Expected Behavior

Expected behaviour is jobs in executing state.

jozuas commented 1 month ago

The jobs started executing with Oban.drain_queue/1

iex(tp@100.100.20.1)2> Oban.drain_queue(queue: :docx_jobs_queue)

19:17:45.595 [info] {"args":{"fields_to_extract":[],"source_file_key":"T8BcswX63nLlNXDFafSbig.pdf"},"id":886610,"meta":{},"system_time":1722442665595434716,"max_attempts":1,"queue":"docx_jobs_queue","worker":"TpData.DocX.Workers.Worker","source":"oban","event":"job:start","attempt":1,"tags":[]}

sorentwo commented 1 month ago

Do you have other queues actively running jobs? Is this the only one in a "stuck" state?

The jobs started executing with Oban.drain_queue/1

That's good. Draining runs the jobs in the calling process and doesn't rely on staging or producers.

jozuas commented 1 month ago

At that point in time we had other queues, but none of them had jobs that were supposed to be run - nothing in available for those other queues.

noozo commented 1 month ago

+1 Seeing the same problem, where jobs are just sitting there available and don't get processed unless i drain_queue :(

sorentwo commented 1 month ago

How many nodes are you running on? Do you have a heterogeneous setup with separate web and worker nodes? Please share your Oban config.

jozuas commented 1 month ago

We have:

one web node for the API that has its own Postgres DB for user management.
one job processing node that has its own Postgres DB, which is exposed to the web node in read-only mode through a dedicated Ecto.Repo. All API queries are specifically configured to use the read-only Ecto.Repo.
Obans on both nodes running in isolation, that is, the other node does not appear in oban_peers.
Runtime configuration is applied by hostname.
No distributed erlang features used between the nodes.

Shared compile-time config

config :tp, Oban,
  engine: Oban.Engines.Basic,
  queues: [
    apa_sources_queue: [limit: 4],
    reference_sources_queue: [limit: 3],
    bond_pipeline_queue: [limit: 1],
    sitemap_builder_queue: [limit: 1],
    docx_jobs_queue: [limit: 10]
  ],
  repo: TpData.Repo

Run-time config for the job processing node

# Modules listing minimised to not leak business information
overnight_apa_modules = [
  TpData.TraxApa
]

# Modules listing minimised to not leak business information
overnight_reference_modules = [
  TpData.Gleif.Level1,
  TpData.Gleif.Level2,
  TpData.Gleif.LeiToIsin
]

# Modules listing minimised to not leak business information
overnight_document_modules = [
  TpData.DocCollect.Esma,
  TpData.DocCollect.Edgar
]

apa_source_cron = "30 2 * * *"
reference_source_cron = "00 2 * * *"
bond_pipeline_cron = "00 4 * * *"
sitemap_builder_cron = "00 */4 * * *"

config :tp, Oban,
plugins: [
  {Oban.Plugins.Cron,
   crontab:
     Enum.concat([
       Enum.map(overnight_apa_modules, fn mod ->
         {apa_source_cron, TpData.Workers.OvernightApaCollection, args: %{module: mod}}
       end),
       Enum.map(overnight_reference_modules, fn mod ->
         {reference_source_cron, TpData.Workers.OvernightReferenceCollection,
          args: %{module: mod}}
       end),
       Enum.map(overnight_document_modules, fn mod ->
         {reference_source_cron, TpData.Workers.OvernightDocumentsCollection,
          args: %{module: mod}}
       end),
       [{bond_pipeline_cron, TpData.Workers.BondPipeline}]
     ])},
  {Oban.Plugins.Pruner, max_age: 60 * 60 * 24 * 90},
  {Oban.Plugins.Lifeline, rescue_after: :timer.minutes(60)}
]

Runtime config for web node (should be irrelevant)

      config :tp, Oban,
        plugins: [
          {Oban.Plugins.Cron,
           crontab:
             Enum.concat([
               [{sitemap_builder_cron, TpWeb.Workers.SitemapBuilder}]
             ])},
          {Oban.Plugins.Pruner, max_age: 60 * 60 * 24 * 90},
          {Oban.Plugins.Lifeline, rescue_after: :timer.minutes(30)}
        ]

sorentwo commented 1 month ago

The typical reasons that jobs won't process are:

A node that isn't running plugins is the leader (doesn't seem to be the case here, you're running plugins)
The leader can't communicate with other nodes (notifier issues, may be the case)
Fetch queries are failing because of database issues (doesn't seem to be the case, draining worked)

Do you use the default logger? If so, do you see any stager:switch events or (notifier:switch events) logged?

jozuas commented 1 month ago

Do you use the default logger? Here's my Application.start/2
:ok = Oban.Telemetry.attach_default_logger()
If so, do you see any stager:switch events or (notifier:switch events) logged?

Oh, I found the following in my journalctl

{"message":"notifier only receiving messages from its own node, functionality may be degraded","status":"solitary","source":"oban","event":"notifier:switch"}

sorentwo commented 1 month ago

If the notifier is switching without the stager switching, then that could mean the leader node (the one issuing notices to queues that they have jobs) can't reach other nodes. Since you're using the default postgres notifier, that means it can't send messages even to itself.

Are you using a pooler like pg_bouncer by chance? Are you able to switch to the PG notifier?

feld commented 1 month ago

I've been receiving reports of users of Pleroma that their jobs sometimes stop executing since the update to Oban v2.17

I have not experienced it yet so I have not been able to diagnose it.

edit: Single node, default notifier. The v12 migration was run to disable the triggers.

sorentwo commented 1 month ago

@feld Any particular version? Oban v2.17.0 came out in December 2023. Could it be related to Elixir/Erlang changes?

feld commented 1 month ago

@sorentwo The first reports came after the update from Oban 2.13.6 to 2.17.10 (within the last month or so) as we were increasing our oldest supported Elixir version from 1.12 to 1.13. I'm observing no issues on Erlang 26.2.5 and Elixir 1.17.1, but the versions deployed by the users will vary. I'll start collecting them for you.

sorentwo commented 1 month ago

I'm observing no issues on Erlang 26.2.5 and Elixir 1.17.1, but the versions deployed by the users will vary. I'll start collecting them for you.

Great, thanks!

jozuas commented 1 month ago

@sorentwo

If the notifier is switching without the stager switching, then that could mean the leader node (the one issuing notices to queues that they have jobs) can't reach other nodes. Since you're using the default postgres notifier, that means it can't send messages even to itself.

👀

Are you using a pooler like pg_bouncer by chance?

We are not using a pooler like pg_bouncer, only Ecto pools.

Are you able to switch to the PG notifier?

I did, but it doesn't seem like anything changed. Here's example dev config (postgresql on localhost without middleware like PGBouncer) and the message captured in single iEX session:

12:13:28.542 pid=<0.893.0>  [info] {"message":"job staging switched to local mode. local mode polls for jobs for every queue; restore global mode with a functional notifier","mode":"local","source":"oban","event":"stager:switch"}
iex(tp@127.0.0.1)1> Oban.config
%Oban.Config{
  dispatch_cooldown: 5,
  engine: Oban.Engines.Basic,
  get_dynamic_repo: nil,
  insert_trigger: true,
  log: false,
  name: Oban,
  node: "tp@127.0.0.1",
  notifier: {Oban.Notifiers.PG, []},
  peer: {Oban.Peers.Postgres, []},
  plugins: [],
  prefix: "public",
  queues: [
    apa_sources_queue: [limit: 4],
    reference_sources_queue: [limit: 3],
    bond_pipeline_queue: [limit: 1],
    docx_jobs_queue: [limit: 20]
  ],
  repo: TpData.Repo,
  shutdown_grace_period: 15000,
  stage_interval: 1000,
  testing: :disabled
}

sorentwo commented 1 month ago

I did, but it doesn't seem like anything changed.

If you don't have a functional distributed erlang cluster then the PG notifier won't actually work. The fact that it switched to polling mode implies it definitely can't exchange messages now. However, while polling mode isn't as efficient, it should get your jobs processing. Have your jobs been processing since this issue was opened?

There doesn't appear to be an Oban bug behind all of this, more of an environmental/configuration issue, and I'd like to move continued support to the Forum.

jozuas commented 1 month ago

I think yesterday I managed to understand what was going on. We had an Oban queue with a max capacity of 10. Only 7 jobs were running at the time, but the 7 jobs were enough to cause 100% CPU utilisation (we are doing some CPU intensive OCR work). A colleague of mine thought there was some problem that only 7 jobs were running, so he did delete from oban_jobs where state = 'available' followed by delete from oban_jobs where state = 'executing' and reinserted all the jobs from a livebook. These deletes most likely caused the problem. Since I explained to him that it does not matter to us that the number of jobs are not equal to max queue size, we simply want the hardware to be maximally utilised, he hasn't run these deletes and we haven't had any problems with no jobs being run.

sorentwo commented 1 month ago

🤦 Deleting jobs straight out of the database will certainly do it. Let your colleague know they can use Oban.scale_queue/2 at runtime to reduce concurrency if needed.

oban-bg / oban