sorentwo / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://getoban.pro
Apache License 2.0
3.18k stars 297 forks source link

Oban Pro v1.3.1 Workflows jobs not moving to executing state #1019

Closed omaralsoudanii closed 6 months ago

omaralsoudanii commented 6 months ago

Environment

Current Behavior

Hi! Upgrading to the latest Oban Pro v1.3.1 and using Oban.Pro.Workers.Workflow with ack_async: false, the jobs stay in the available state and hang there.

Trying the same without ack_async: false, the jobs are moved from the available state to executing state correctly.

I added ack_async: false because I'm using recorded output in the jobs.

Expected Behavior

Having the ability to use Oban.Pro.Workers.Workflow with ack_async: false

vhf commented 6 months ago

We're observing the same thing with Chunk workers.

sorentwo commented 6 months ago

Do you have global limits or rate limits configured? Will you share more about your configuration?

vhf commented 6 months ago

In our case (please say so if you'd rather have me open another issue):

my_queue: [global_limit: 10, paused: true],
  use Oban.Pro.Workers.Chunk,
    queue: :my_queue,
    size: 75,
    timeout: :timer.seconds(10),
    max_attempts: 10

my_queue is automatically unpaused after startup, what we're seeing is the unpaused queue with hundreds of thousands of jobs accumulating in "available" state, no job going through. Sometimes a burst of jobs goes through, then not a single one for >30min, then another short burst, etc. Single-node setup so global_limit could just as well be local.

sorentwo commented 6 months ago

@vhf Thanks! That's very helpful.

The original issue mentions that they've explicitly set ack_async: false. Is that true in your case as well?

vhf commented 6 months ago

The original issue mentions that they've explicitly set ack_async: false. Is that true in your case as well?

It is not, all we did was upgrade oban_pro from 1.2.2 to 1.3.0 without any code or config change, ran into this problem, downgraded to 1.2.2 (works as expected), upgraded from 1.2.2 to 1.3.1, ran into this problem, solved it by downgrading to 1.2.2 again.

sorentwo commented 6 months ago

The original issue was from the combination of a global limit and ack_async: false, not due to workflows.

@vhf Your issue is from the combination of a global queue and the chunk worker. Fixing that one now.

sorentwo commented 6 months ago

@vhf I may have spoke too soon there—I can't recreate your issue so far. We're also running chunks in a globally limited queue on v1.3.1 and it's working as expected. Please reach out on Slack so I can gather additional details.

omaralsoudanii commented 6 months ago

Hey @sorentwo, we do use a combination of global limit and ack_async: false - here's a sample config for a queue:

[
      ack_async: false,
      local_limit: 50,
      global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],
      rate_limit: [allowed: 5_000, period: {1, :minute}]
    ]

I'm not sure if there are any alternatives to make the Workflow work with the above config in this case, I tried again and the same behaviour happens where jobs are stalled in the available state.

sorentwo commented 6 months ago

@omaralsoudanii The issue you encountered is fixed in v1.3.2. Thanks for the report!

omaralsoudanii commented 6 months ago

Thanks for jumping into this quickly! @sorentwo I tested it, and the jobs are getting executed now - however, I noticed that the global limit partitioning doesn't work anymore. In the example below, Oban v1.2 executes 10 jobs at a time, and with Oban v1.3, all the jobs are executing at the same time 🤔

global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],
sorentwo commented 6 months ago

@omaralsoudanii Is that also with ack_async: false? Side question, what prompted you to run with ack_async: false initially?

omaralsoudanii commented 6 months ago

@sorentwo Yeah it is also with ack_async: false, this is the full config for the queue :)

[
      ack_async: false,
      local_limit: 50,
      global_limit: [
        allowed: 10,
        partition: [fields: [:args], keys: [:some_key]]
      ],
      rate_limit: [allowed: 5_000, period: {1, :minute}]
    ]

Side question, what prompted you to run with ack_async: false initially?

I'm using the recorded jobs feature in Oban. As soon as the job finishes executing, I retrieve the data via the Worker hooks. This is not possible with the newly introduced async tracking due to the slight lag documented here: https://getoban.pro/docs/pro/1.3.2/Oban.Pro.Engines.Smart.html#module-async-tracking

sorentwo commented 6 months ago

Edit: Scratch that. The recording is removed before the hook fires. It should be available without pulling it back from the database though.

omaralsoudanii commented 5 months ago

@sorentwo The issue I'm noticing now is that the global partition limiting doesn't work. In the example I sent with Oban Pro v1.2 the queue executes 10 jobs at the most, right now it's executing all the jobs at the same time with no rate limiting.

sorentwo commented 5 months ago

@omaralsoudanii v1.3.3 is out extensive testing and overhauling sync acking to force serialized updates.

In addition, there's a new after_process/3 callback so you can get the return value in the hook without fetching. Now you don't need to use ack_async: false 🙂

def after_process(state, job, result) do
  ...
end
omaralsoudanii commented 4 months ago

Thank you @sorentwo! It works perfectly now without the need to use ack_async: false. Love the addition of the new after_process/3 hook!