sorentwo / oban

💎 Robust job processing in Elixir, backed by modern PostgreSQL and SQLite3
https://getoban.pro
Apache License 2.0
3.18k stars 297 forks source link

Potential bug: available jobs not being run #1050

Closed jordi-chacon closed 3 months ago

jordi-chacon commented 4 months ago

Environment

Current Behavior

Our production environment has an Oban queue called quickbooks with available jobs that aren't being run.

distru-worker@172.16.10.9> Oban.check_queue(queue: :quickbooks)
%{
  global_limit: %Oban.Pro.Producer.Meta.GlobalLimit{
    allowed: 1,
    tracked: %{},
    partition: %Oban.Pro.Producer.Meta.GlobalLimit.Partition{
      fields: ["args"],
      keys: ["company_id"]
    }
  },
  local_limit: 1,
  name: "Oban",
  node: "distru-worker@172.16.10.9",
  paused: false,
  queue: "quickbooks",
  rate_limit: nil,
  retry_attempts: 5,
  retry_backoff: 1000,
  running: [],
  started_at: ~U[2024-03-01 04:43:46.110714Z],
  updated_at: ~U[2024-03-01 11:58:25.977869Z],
  uuid: "018df853-503e-7fc4-993c-f77f58518225"
}
select * from oban_producers where queue = 'quickbooks';
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid        | 018df853-503e-7fc4-993c-f77f58518225
name        | Oban
node        | distru-worker@172.16.10.9
queue       | quickbooks
meta        | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at  | 2024-03-01 04:43:46.110714
updated_at  | 2024-03-01 11:51:28.61792
-[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid        | 018df856-c203-7e5d-865d-d237a11b7fa9
name        | Oban
node        | distru-worker@172.16.6.4
queue       | quickbooks
meta        | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at  | 2024-03-01 04:47:31.843039
updated_at  | 2024-03-01 11:51:32.06871

The meta field on those two oban_producer rows has not changed for more than 45 minutes.

There are only 6 oban_job rows in that queue, all of them scheduled 20-45 minutes ago, still waiting to be picked up:

distru_prod=> select * from oban_jobs where queue = 'quickbooks';
-[ RECORD 1 ]+----------------------------------------------------------------------------------------------
id           | 1436546800
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Invoice
args         | {"activity_id": 2515744}
errors       | {}
attempt      | 0
max_attempts | 1
inserted_at  | 2024-03-01 11:33:11.648113
scheduled_at | 2024-03-01 11:33:11.648113
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"uniq_key": 21778305}
cancelled_at |
-[ RECORD 2 ]+----------------------------------------------------------------------------------------------
id           | 1436526012
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches
args         | {}
errors       | {}
attempt      | 0
max_attempts | 2
inserted_at  | 2024-03-01 11:09:00.680848
scheduled_at | 2024-03-01 11:09:00.680848
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"uniq_key": 7130876, "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches"}
cancelled_at |
-[ RECORD 3 ]+----------------------------------------------------------------------------------------------
id           | 1436526013
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches
args         | {}
errors       | {}
attempt      | 0
max_attempts | 2
inserted_at  | 2024-03-01 11:09:00.680848
scheduled_at | 2024-03-01 11:09:00.680848
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"uniq_key": 8143479, "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches"}
cancelled_at |
-[ RECORD 4 ]+----------------------------------------------------------------------------------------------
id           | 1436537052
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Invoice
args         | {"activity_id": 2515742}
errors       | {}
attempt      | 0
max_attempts | 1
inserted_at  | 2024-03-01 11:21:44.338426
scheduled_at | 2024-03-01 11:21:44.338426
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"uniq_key": 17555946}
cancelled_at |
-[ RECORD 5 ]+----------------------------------------------------------------------------------------------
id           | 1436537053
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Invoice
args         | {"activity_id": 2515743}
errors       | {}
attempt      | 0
max_attempts | 1
inserted_at  | 2024-03-01 11:21:44.338426
scheduled_at | 2024-03-01 11:21:44.338426
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"uniq_key": 21873225}
cancelled_at |
-[ RECORD 6 ]+----------------------------------------------------------------------------------------------
id           | 1436526006
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Invoice
args         | {"activity_id": 2515741}
errors       | {}
attempt      | 0
max_attempts | 1
inserted_at  | 2024-03-01 11:08:04.661139
scheduled_at | 2024-03-01 11:11:53.863781
attempted_at | 2024-03-01 11:11:43.802447
completed_at |
attempted_by | {distru-worker@172.16.6.4,018df856-c203-7e5d-865d-d237a11b7fa9}
discarded_at |
priority     | 0
tags         | {}
meta         | {"snoozed": 1, "uniq_key": 102483006}
cancelled_at |

Searching for oban and quickbooks in our logs yields these results:

time=2024-03-01T11:54:47.662 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:47.642 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:47.632 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:47.631 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:46.647 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:46.630 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:46.620 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:46.619 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:45.633 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:45.619 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:45.608 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:45.608 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:44.623 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:44.609 level=info message=oban:producer:stop dispatched_count=0 queue=quickbook   
time=2024-03-01T11:54:44.598 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:44.597 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:43.615 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:43.597 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks  
time=2024-03-01T11:54:43.586 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:43.586 level=info message=oban:producer:start queue=quickbooks
time=2024-03-01T11:54:42.608 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
time=2024-03-01T11:54:42.586 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks

Expected Behavior

We would expect Oban to run those 6 jobs.

Please let us know if you need any more info from us to help you debug this potential problem. Thank you so much!!

sorentwo commented 4 months ago

Thank you for all of the details, that's very helpful! I'll investigate.

Kalaww commented 4 months ago

Hello, I encounter the same issue twice. I am using oban pro with a queue that has a partition key using args and one partition got stuck

Kalaww commented 4 months ago

My versions are:

Kalaww commented 4 months ago

My queue is setup like this

    {:my_queue, [
      local_limit: 3,
      global_limit: [
        allowed: 1,
        partition: [fields: [:args], keys: [:match_id, :company_id]]
      ]
    ]}

All jobs with partition key match_id=932200 and company_id=414 were stuck in available state. I checked the oban producers table and I found one producer with the partition key

{
  "paused": false,
  "rate_limit": null,
  "local_limit": 3,
  "global_limit": {
    "allowed": 1,
    "tracked": {
      "102888629": {
        "args": {
          "match_id": 932200,
          "company_id": 414
        },
        "count": 1,
        "worker": null
      },
      "109020306": {
        "args": {
          "match_id": 932225,
          "company_id": 54
        },
        "count": 1,
        "worker": null
      },
      "116150022": {
        "args": {
          "match_id": 932225,
          "company_id": 95
        },
        "count": 1,
        "worker": null
      }
    },
    "partition": {
      "keys": [
        "match_id",
        "company_id"
      ],
      "fields": [
        "args"
      ]
    }
  },
  "retry_backoff": 1000,
  "retry_attempts": 5
}
jordi-chacon commented 1 month ago

Hey @sorentwo, we are still seeing this issue in our production environment.

Environment

Current Behavior

Our production environment has an Oban queue called quickbooks with available jobs that aren't being run.

distru-worker@172.16.10.9> Oban.check_queue(queue: :quickbooks)
%{
  global_limit: %Oban.Pro.Producer.Meta.GlobalLimit{
    allowed: 1,
    tracked: %{},
    partition: %Oban.Pro.Producer.Meta.GlobalLimit.Partition{
      fields: [:args],
      keys: ["company_id"]
    }
  },
  local_limit: 1,
  name: "Oban",
  node: "distru-worker@172.16.9.136",
  paused: false,
  queue: "quickbooks",
  rate_limit: nil,
  retry_attempts: 5,
  retry_backoff: 1000,
  running: [],
  started_at: ~U[2024-05-24 05:03:44.209758Z],
  updated_at: ~U[2024-05-24 08:39:42.510683Z],
  uuid: "018fa8fb-c851-776c-9a5c-bf8f29c7cdc7"
}
select * from oban_producers where queue = 'quickbooks';
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid        | 018fa8fb-c851-776c-9a5c-bf8f29c7cdc7
name        | Oban
node        | distru-worker@172.16.9.136
queue       | quickbooks
meta        | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at  | 2024-05-24 05:03:44.209758
updated_at  | 2024-05-24 08:42:02.817981
-[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid        | 018fa9a0-cf30-7b48-bdf2-f6e08e3d4f00
name        | Oban
node        | distru-worker@172.16.8.136
queue       | quickbooks
meta        | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at  | 2024-05-24 08:03:59.408379
updated_at  | 2024-05-24 08:42:10.702454

There are only 2 oban_job rows in the quickbooks queue, both of them scheduled 30 minutes ago, still waiting to be picked up:

select now();
-[ RECORD 1 ]----------------------
now | 2024-05-24 08:43:23.494669+00

select * from oban_jobs where queue = 'quickbooks';
-[ RECORD 1 ]+----------------------------------------------------------------------------------------------------------------------------------------
id           | 1557360114
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches
args         | {}
errors       | {}
attempt      | 0
max_attempts | 2
inserted_at  | 2024-05-24 08:15:00.178444
scheduled_at | 2024-05-24 08:15:00.178444
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"cron": true, "uniq_key": 8143479, "cron_expr": "*/3 * * * *", "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches"}
cancelled_at |
-[ RECORD 2 ]+----------------------------------------------------------------------------------------------------------------------------------------
id           | 1557360115
state        | available
queue        | quickbooks
worker       | Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches
args         | {}
errors       | {}
attempt      | 0
max_attempts | 2
inserted_at  | 2024-05-24 08:15:00.178444
scheduled_at | 2024-05-24 08:15:00.178444
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority     | 0
tags         | {}
meta         | {"cron": true, "uniq_key": 7130876, "cron_expr": "*/3 * * * *", "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches"}
cancelled_at |

Searching for oban and quickbooks in our logs yields these results:

2024-05-24 08:45:01 
time=2024-05-24T08:45:01.889 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:01 
time=2024-05-24T08:45:01.876 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.936 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.921 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.916 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.892 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.881 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.875 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.864 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00 
time=2024-05-24T08:45:00.863 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:59 
time=2024-05-24T08:44:59.879 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:59 
time=2024-05-24T08:44:59.864 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:59 
time=2024-05-24T08:44:59.852 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:59 
time=2024-05-24T08:44:59.852 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:58 
time=2024-05-24T08:44:58.864 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:58 
time=2024-05-24T08:44:58.850 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
...

Expected Behavior

We would expect Oban to run those 6 jobs.

Please let us know if you need any more info from us to help you debug this potential problem. Thank you so much!!

sorentwo commented 1 month ago

Truly surprised that is happening with Pro v1.4.6. I'll look into whether there's an edge case where the producer isn't updated somehow, as it seems to be a repeated issue for your app in particular.

sorentwo commented 1 month ago

@jordi-chacon Tracked it down! It'll be fixed in v1.4.9, to be released today.

jordi-chacon commented 1 month ago

@sorentwo Really great news!! Excited to give it a try. I'll let you know the outcome. Thank you so much for your work!

axelson commented 4 days ago

@sorentwo could a similar bug affect the OSS version? I am noticing a similar issue in 2.17.10 OSS where 3 jobs were running but the limit for the queue was 50. But after restarting the server I'm not seeing the behavior again so far (but maybe it'll re-appear as we work through the backlog of jobs).

sorentwo commented 4 days ago

could a similar bug affect the OSS version?

No, this was specifically a race that prevented proper tracking in the database. OSS doesn't do that at all.

If that happens again, please run check_queue/1 and share the output. It will show how many tracked jobs the producer had.

axelson commented 4 days ago

Okay it's happened again, now it's only using 39 of the 50 jobs available: Screenshot 2024-07-14 at 21 45 17@2x

iex(felt@10.104.228.146)2> Oban.check_queue(queue: :bulk_reprocess)
%{
  name: Oban,
  node: "felt@10.104.228.146",
  running: [5919573, 5914058, 5921656, 5912884, 5915958, 5913434, 5923039,
   5922405, 5923052, 5923054, 5923031, 5922996, 5915958, 5918199, 5923048,
   5909406, 5923046, 5914058, 5912884, 5923045, 5915095, 5923051, 5922859,
   5923053, 5922999, 5923047, 5922839, 5916356, 5923050, 5916356, 5923032,
   5921264, 5918199, 5923042, 5921414, 5921444, 5913078, 5911265, 5921445,
   5923043, 5913078, 5920784, 5922869, 5915095, 5921586, 5913434, 5923044, ...],
  queue: "bulk_reprocess",
  started_at: ~U[2024-07-15 00:34:16.545960Z],
  limit: 50,
  uuid: "8b30b35f-e0e3-4131-bb6b-32c89aae12d7",
  updated_at: ~U[2024-07-15 02:44:39.954415Z],
  paused: false,
  refresh_interval: 30000
}

And the list of the running jobs does contain 50 jobs

Also let me know if I should open this as a separate issue since it seems likely to be unrelated.

sorentwo commented 3 days ago

@axelson It seems like a down/exit message isn't handled properly, and it's leaking running ids because they're not cleaned up properly. This has effected some others as well, but they haven't captured it in the moment.

I'll contact you to see if we can get to the bottom of what's happening here.