laravel / framework

The Laravel Framework.
https://laravel.com
MIT License
32.53k stars 11.02k forks source link

Deadlocks on high-volume queues with database job batching #36478

Closed murphatron closed 3 years ago

murphatron commented 3 years ago

Description:

Much like issue 7046, we're encountering deadlocks on job batches in the job count of about 30 to the batch. The linked issue describes a similar case in queues utilizing the database queue driver solved for MySQL 8 in the PR #31287. I'm unsure if this is a suitable solution given that the framework is attempting to select a specific batch for update versus popping the next job. The deadlock occurs from this method in the DatabaseBatchRepository. Since it's typically suggested taking advantage of the transaction retry argument to handle deadlocks, it may be a suitable solution here.

Steps To Reproduce:

Simply dispatch a batch of jobs to a queue with a moderate amount workers. The higher the job count, the higher the probability is for deadlocks. My environment has about 40 workers. Having this many is likely unnecessary for the test.

TheLevti commented 3 years ago

Maybe next time it happens, execute SHOW ENGINE INNODB STATUS and give us a dump of the deadlock section (please make sure to remove sensitive data if there is some at all before dumping it here). This could help debug the deadlock situation.

driesvints commented 3 years ago

@murphatron can you provide the info @TheLevti requested?

murphatron commented 3 years ago

Thanks for the quick reply. The latest captured deadlock unfortunately isn't related to this. I'll follow up when it occurs again.

driesvints commented 3 years ago

@murphatron feel free to reopen an issue with the info when it occurs again 👍

gm-lunatix commented 3 years ago

Not sure a comment on a closed ticket will be viewed, but... We're using postgresql (v11), and horizon with redis for our queues. Using batched jobs forces us to use the database overhead since the batch progress is tracked in a database table (job_batches).

If I have let say 1000 jobs in the batch and those are processed on 4 supervisors' 100 workers on two servers, then 2 x 4 x 100 concurrent job is trying to update the relevant fields (total_jobs, pending_jobs or failed_jobs). Since an update always starts with a "select * for update" the concurrency causes a long row level lock for most of the jobs (https://www.postgresql.org/docs/11/explicit-locking.html#LOCKING-ROWS). That in our case from ~800 concurrent jobs creates a race condition on the database server, snowballing the issue back to the queues, where jobs normally processed under 1 sec are now processed in 3-4+ secs, creating a huge backlog on the queues, stressing the db server and the queue worker servers further. Here's a snapshot of our database server top SQL queries showing the tuple (batch id) and transaction (update to increment a field) lock severity. image