Closed jordi-chacon closed 3 months ago
Thank you for all of the details, that's very helpful! I'll investigate.
Hello, I encounter the same issue twice. I am using oban pro with a queue that has a partition key using args and one partition got stuck
My versions are:
2.17.3
1.3.4
15
1.14.3
25.2.2
My queue is setup like this
{:my_queue, [
local_limit: 3,
global_limit: [
allowed: 1,
partition: [fields: [:args], keys: [:match_id, :company_id]]
]
]}
All jobs with partition key match_id=932200 and company_id=414 were stuck in available state. I checked the oban producers table and I found one producer with the partition key
{
"paused": false,
"rate_limit": null,
"local_limit": 3,
"global_limit": {
"allowed": 1,
"tracked": {
"102888629": {
"args": {
"match_id": 932200,
"company_id": 414
},
"count": 1,
"worker": null
},
"109020306": {
"args": {
"match_id": 932225,
"company_id": 54
},
"count": 1,
"worker": null
},
"116150022": {
"args": {
"match_id": 932225,
"company_id": 95
},
"count": 1,
"worker": null
}
},
"partition": {
"keys": [
"match_id",
"company_id"
],
"fields": [
"args"
]
}
},
"retry_backoff": 1000,
"retry_attempts": 5
}
Hey @sorentwo, we are still seeing this issue in our production environment.
2.17.10
1.4.6
PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
1.15.4
25.3.2.5
Our production environment has an Oban queue called quickbooks
with available jobs that aren't being run.
distru-worker@172.16.10.9> Oban.check_queue(queue: :quickbooks)
%{
global_limit: %Oban.Pro.Producer.Meta.GlobalLimit{
allowed: 1,
tracked: %{},
partition: %Oban.Pro.Producer.Meta.GlobalLimit.Partition{
fields: [:args],
keys: ["company_id"]
}
},
local_limit: 1,
name: "Oban",
node: "distru-worker@172.16.9.136",
paused: false,
queue: "quickbooks",
rate_limit: nil,
retry_attempts: 5,
retry_backoff: 1000,
running: [],
started_at: ~U[2024-05-24 05:03:44.209758Z],
updated_at: ~U[2024-05-24 08:39:42.510683Z],
uuid: "018fa8fb-c851-776c-9a5c-bf8f29c7cdc7"
}
select * from oban_producers where queue = 'quickbooks';
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid | 018fa8fb-c851-776c-9a5c-bf8f29c7cdc7
name | Oban
node | distru-worker@172.16.9.136
queue | quickbooks
meta | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at | 2024-05-24 05:03:44.209758
updated_at | 2024-05-24 08:42:02.817981
-[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
uuid | 018fa9a0-cf30-7b48-bdf2-f6e08e3d4f00
name | Oban
node | distru-worker@172.16.8.136
queue | quickbooks
meta | {"paused": false, "rate_limit": null, "local_limit": 1, "global_limit": {"allowed": 1, "tracked": {"55156538": {"args": {}, "count": 1, "worker": null}}, "partition": {"keys": ["company_id"], "fields": ["args"]}}, "retry_backoff": 1000, "retry_attempts": 5}
running_ids | {}
started_at | 2024-05-24 08:03:59.408379
updated_at | 2024-05-24 08:42:10.702454
There are only 2 oban_job
rows in the quickbooks
queue, both of them scheduled 30 minutes ago, still waiting to be picked up:
select now();
-[ RECORD 1 ]----------------------
now | 2024-05-24 08:43:23.494669+00
select * from oban_jobs where queue = 'quickbooks';
-[ RECORD 1 ]+----------------------------------------------------------------------------------------------------------------------------------------
id | 1557360114
state | available
queue | quickbooks
worker | Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches
args | {}
errors | {}
attempt | 0
max_attempts | 2
inserted_at | 2024-05-24 08:15:00.178444
scheduled_at | 2024-05-24 08:15:00.178444
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority | 0
tags | {}
meta | {"cron": true, "uniq_key": 8143479, "cron_expr": "*/3 * * * *", "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllCustomerCaches"}
cancelled_at |
-[ RECORD 2 ]+----------------------------------------------------------------------------------------------------------------------------------------
id | 1557360115
state | available
queue | quickbooks
worker | Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches
args | {}
errors | {}
attempt | 0
max_attempts | 2
inserted_at | 2024-05-24 08:15:00.178444
scheduled_at | 2024-05-24 08:15:00.178444
attempted_at |
completed_at |
attempted_by |
discarded_at |
priority | 0
tags | {}
meta | {"cron": true, "uniq_key": 7130876, "cron_expr": "*/3 * * * *", "cron_name": "Distru.Workers.Quickbooks.Cache.UpdateAllVendorCaches"}
cancelled_at |
Searching for oban
and quickbooks
in our logs yields these results:
2024-05-24 08:45:01
time=2024-05-24T08:45:01.889 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:01
time=2024-05-24T08:45:01.876 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.936 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.921 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.916 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.892 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.881 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.875 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.864 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:45:00
time=2024-05-24T08:45:00.863 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:59
time=2024-05-24T08:44:59.879 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:59
time=2024-05-24T08:44:59.864 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:59
time=2024-05-24T08:44:59.852 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:59
time=2024-05-24T08:44:59.852 level=info message=oban:producer:start queue=quickbooks
2024-05-24 08:44:58
time=2024-05-24T08:44:58.864 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
2024-05-24 08:44:58
time=2024-05-24T08:44:58.850 level=info message=oban:producer:stop dispatched_count=0 queue=quickbooks
...
We would expect Oban to run those 6 jobs.
Please let us know if you need any more info from us to help you debug this potential problem. Thank you so much!!
Truly surprised that is happening with Pro v1.4.6. I'll look into whether there's an edge case where the producer isn't updated somehow, as it seems to be a repeated issue for your app in particular.
@jordi-chacon Tracked it down! It'll be fixed in v1.4.9, to be released today.
@sorentwo Really great news!! Excited to give it a try. I'll let you know the outcome. Thank you so much for your work!
@sorentwo could a similar bug affect the OSS version? I am noticing a similar issue in 2.17.10 OSS where 3 jobs were running but the limit for the queue was 50. But after restarting the server I'm not seeing the behavior again so far (but maybe it'll re-appear as we work through the backlog of jobs).
could a similar bug affect the OSS version?
No, this was specifically a race that prevented proper tracking in the database. OSS doesn't do that at all.
If that happens again, please run check_queue/1
and share the output. It will show how many tracked jobs the producer had.
Okay it's happened again, now it's only using 39 of the 50 jobs available:
iex(felt@10.104.228.146)2> Oban.check_queue(queue: :bulk_reprocess)
%{
name: Oban,
node: "felt@10.104.228.146",
running: [5919573, 5914058, 5921656, 5912884, 5915958, 5913434, 5923039,
5922405, 5923052, 5923054, 5923031, 5922996, 5915958, 5918199, 5923048,
5909406, 5923046, 5914058, 5912884, 5923045, 5915095, 5923051, 5922859,
5923053, 5922999, 5923047, 5922839, 5916356, 5923050, 5916356, 5923032,
5921264, 5918199, 5923042, 5921414, 5921444, 5913078, 5911265, 5921445,
5923043, 5913078, 5920784, 5922869, 5915095, 5921586, 5913434, 5923044, ...],
queue: "bulk_reprocess",
started_at: ~U[2024-07-15 00:34:16.545960Z],
limit: 50,
uuid: "8b30b35f-e0e3-4131-bb6b-32c89aae12d7",
updated_at: ~U[2024-07-15 02:44:39.954415Z],
paused: false,
refresh_interval: 30000
}
And the list of the running jobs does contain 50 jobs
Also let me know if I should open this as a separate issue since it seems likely to be unrelated.
@axelson It seems like a down/exit message isn't handled properly, and it's leaking running ids because they're not cleaned up properly. This has effected some others as well, but they haven't captured it in the moment.
I'll contact you to see if we can get to the bottom of what's happening here.
Environment
2.17.5
1.3.5
PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit
1.15.4
25.3.2.5
Current Behavior
Our production environment has an Oban queue called
quickbooks
with available jobs that aren't being run.The
meta
field on those twooban_producer
rows has not changed for more than 45 minutes.There are only 6
oban_job
rows in that queue, all of them scheduled 20-45 minutes ago, still waiting to be picked up:Searching for
oban
andquickbooks
in our logs yields these results:Expected Behavior
We would expect Oban to run those 6 jobs.
Please let us know if you need any more info from us to help you debug this potential problem. Thank you so much!!