laravel / ideas

Issues board used for Laravel internals discussions.
939 stars 28 forks source link

Laravel Job Batching's atomic locks limit scaling #2591

Closed bkuhl closed 3 years ago

bkuhl commented 3 years ago

Problem

DatabaseBatchRepository::decrementPendingJobs() and a few other methods in this repository select a record from the database, add a lock on the row, do some work and update the values in the database. The more workers there are processing a single batch, the more impactful these locks are at blocking the queue from doing work and occupying resources.

Feature Request

Ability to configure Queue Batches to not record the ids of failed jobs but only keep track of simple counts, enabling a single UPDATE query to be sufficient when updating pending_jobs, failed_jobs, etc on the batch's database record.

My Experience

Recently I began digging into high CPU issues on my application's database. We're deployed using docker and operate 6 Horizon containers, which translates to 48 processes operating on different queues throughout the app with a cap of 42 processes on any given one queue. When enabling AWS RDS Performance Insights I was able to obtain some insight on what's taking so long. It looks as though the culprit is this simple query:

SELECT * FROM `job_batches` WHERE `id` = "91b0f7a8-0693-4cb4-8647-e7509318fbbc" LIMIT 1 FOR UPDATE

In this graph, you can see the blue area signifies that query is having the largest impact.

Screen Shot 2021-04-28 at 7 26 39 AM

Using Laravel Horizon, we can watch in these scenarios as the Jobs Per Minute metric dwindles as these locks begin blocking workers from performing their tasks but the number of jobs in the queue stays in the tends of thousands.

To resolve this issue, we stopped using Laravel batches.

paras-malhotra commented 3 years ago

Interesting, it does seem that decrementPendingJobs has really no need to update failed_job_ids. So, definitely looks like a performance improvement is possible.

bkuhl commented 3 years ago

It does need to update it, because it's effectively removing the pending job from the failed list, if it was there. The underlying performance limitations are created by maintaining the list of jobIds in the batch data.

themsaid commented 3 years ago

Why do you think the performance is impacted by storing the failed jobs IDs?

bkuhl commented 3 years ago

If the failed jobIds weren't stored, the database record could be updated using a simple incrementing value such as x = (x + 7) which is a much faster update than pulling the value out, performing the a diff on the array to ensure failed job ids are unique and writing the new value to the database. When there's a lot of workers trying to do that at the same time, it creates a backup. I'm not sure if MySQL's JSON_ARRAY_APPEND would be any more performant.

themsaid commented 3 years ago

We need the lock in all cases to ensure the operation is atomic.

bkuhl commented 3 years ago

Yep, and a simple UPDATE ... SET pending_jobs (pending_jobs - 1), failed_jobs = (failed_jobs -|+ X) WHERE id = .... would suffice, if we weren't trying to track the id of each failed job.

bkuhl commented 3 years ago

Performing updates using a primary key is very common, so I'm operating under the assumption that isn't the issue. SELECT ... FOR UPDATE is a bit less common, and bypassing that functionality in my case resolved the CPU issues. So I'm assuming that technique is the performance bottleneck.

themsaid commented 3 years ago

You can override the Illuminate\Bus\BatchRepository in your app and disable updating the failed job IDs list if you want. We can't simple remove this functionality and we haven't received any similar complaints about it even from high load applications.

fedeisas commented 3 years ago

@bkuhl Example implementation using JSON_ARRAY_APPEND and JSON_REMOVE.

https://gist.github.com/fedeisas/a81a772b2f8fc63d52f8dde701fbf102