Closed easmith closed 8 years ago
@taylorotwell Any chance we could get the docs update (and/or force only a single DB worker) to avoid these issues? I've had discussions with a lot of users who have been using the DB driver with these issues.
I've came across this issue in 5.4 but sadly I don't have the luxury of moving away from the DB driver.
I'm getting a deadlock on the delete query but I don't understand why. By logging to file at the start and end of job processing, I can see that only one queue worker is picking up the job . Using show engine innodb status
I can see that there is a lock on the row (as expected) and not on the table (also as expected) so I'm at a loss as to why the deadlock is occurring as there doesn't appear to be anything conflicting with the delete statement.
When running queue:work
its been very easy to reproduce an issue where the deadlock occurs then the queue hangs on 3 of the 4 workers. The other worker continues processing the queue, but when cancelled 1 of the 3 others becomes active and carries on processing. Likewise, when you cancel the new active worker, another inactive worker becomes active.
I've been able to work around this issue by adding retries to the deleteReserved()
function:
public function deleteReserved($queue, $id)
{
$this->database->transaction(function() use ($queue, $id) {
if ($this->database->table($this->table)->lockForUpdate()->find($id)) {
$this->database->table($this->table)->where('id', $id)->delete();
}
}, 5);
}
I was able to process 200k jobs across 6 workers in 1 hour without any problems, whereas previously i'd receive lock issues with 4 workers and that would make the queues hang after only a couple of seconds.
@taylorotwell do you think its worth me putting this into a PR?
I'm having the same issue.
PDOException: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction in /home/forge/default/releases/20171113162416/vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOConnection.php:77
I do run on a cluster with 3 Mysql servers if that has anything to do with it.
Dan
@daniel-farina does the workaround I posted solve your problem?
@micmania1 I tried your workaround and it seems to work, thanks! Edit: not entirely, see below...
@micmania1 Do you did the changes in the vendor folder?
@micmania1 if that works I would love a pull request to fix deadlock issues. I have to process 23K jobs.
Hi guys, I've got another deadlock on the system recently even with this new fix.
After running innotop
to inspect the deadlock I got this:
_____________________________________________________________________________________________ Deadlock Transactions _____________________________________________________________________________________________
ID Timestring User Host Victim Time Undo LStrcts Query Text
4 2017-12-09 09:54:02 keychest localhost Yes 00:00 0 3 select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` as
5 2017-12-09 09:54:02 keychest localhost No 00:00 1 3 delete from `jobs` where `id` = ?
______________________________________ Deadlock Locks _______________________________________
ID Waiting Mode DB Table Index Special Ins Intent
4 1 X keychest jobs PRIMARY rec but not gap 0
5 0 X keychest jobs PRIMARY rec but not gap 0
5 1 X keychest jobs jobs_queue_reserved_at_index rec but not gap 0
So apparently the delete transaction got the lock on the PRIMARY
index and later tried to acquire another index jobs_queue_reserved_at_index
. Maybe due to lockForUpdate()
on the previous query.
I am wondering why the select itself is needed there. So I changed this:
if ($this->database->table($this->table)->lockForUpdate()->find($id)) {
$this->database->table($this->table)->where('id', $id)->delete();
}
To this:
$this->database->table($this->table)->where('id', $id)->delete();
And testing it on the dev system.
UPDATE: I am getting different kind of deadlock now:
______________________________________________________________________________________________________ Deadlock Transactions ______________________________________________________________________________________________________
ID Timestring User Host Victim Time Undo LStrcts Query Text
8767 2017-12-11 18:08:03 keychest localhost No 00:00 1 5 update `jobs` set `reserved_at` = ?, `attempts` = ? where `id` = ?
8768 2017-12-11 18:08:03 keychest localhost Yes 00:00 0 4 select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for up
_______________________________________ Deadlock Locks ________________________________________
ID Waiting Mode DB Table Index Special Ins Intent
8767 0 X keychest jobs PRIMARY 0
8767 1 X keychest jobs jobs_queue_reserved_at_index gap before rec 1
8768 1 X keychest jobs PRIMARY rec but not gap 0
So two concurrent job selections are locked. One tries to update the selected job while the another one blocks it.
DatabaseQueue.php
uses manual transaction management with (beginTransaction()
, commit()
). I refactored DatabaseQueue::pop
so it uses closure like transactions with retry:
$this->database->transaction(closure)
. It does correct rollback()
when exception ocurrs (e.g. database detects deadlock)
Deadlock still happens obviously but workers are able to recover.
For those following this issue; @ph4r05 is a fantastic provider of debugging information. This is the first time someone has checked where the deadlock actually occurs. In his case it's a gap lock on the jobs_queue_reserved_at_index index. This index was removed in the default 5.5 jobs table in https://github.com/laravel/framework/commit/acd933fd6b793b0c79a2ba4eb462892cd6c460e3
So, if anyone still have deadlock issues, check your show engine innodb status
, check it is a gap lock on the index, and if so, try removing it.
I should probably say something like "make sure you know your database before randomly removing indexes" or "take backups before drinking" or something. This is an internet forum, I'm talking about changes in important system tables. You know the deal. ;)
Thanks to the @sisve for discussion. I didn't notice the index was removed as I started the project on older Laravel and database migrations did not remove the index during Laravel upgrade (as far as I am aware) - may be the same case for others in this discussion. After removing the index there are no more deadlocks.
@taylorotwell @20TRIES on a discussion about lock contention in DB workers - for the research and testing purposes I reimplemented pessimistic locking of the database queue to optimistic locking. Take a look at my commit if interested: https://github.com/ph4r05/framework/commit/60fcddd9057af5f534fe19eed6336d51bf2c9257
There are no explicit transactions involved and no DB level locking. DB workers compete for jobs on their version
columns, using affected rows
information from update call.
I don't think this is a production ready code (don't use without proper testing) and some benchmarking for pessimistic vs optimistic locking would be nice (I can get to that in a few days).
I can submit PR later (once we have benchmarks) if you deem it interesting.
This is only a theory, but this is what I think is causing the deadlock on delete.
Queue Worker 1 Queue Worker 2
1 Select (lock)
2 Update
3 Process Job Select (lock)
4 Delete Job (locked - fail) Update
5 Delete Job (success) Process Job
6 Delete Job (success - 0 rows affected)
This would mean that a job could get processed twice. I am running this with the assumption of a transaction retry (ie. the workaround I provided).
Whilst the delete retry solved my problem locally, when we began testing in what will be our prod environment we ran into further issues. We're using a galera cluster with master-master db setup. Locking doesn't work the same way and the default queue simply can't be ran concurrently with that setup.
I've solved this in our project by separating the queue into multiple streams where each worker is responsible for a single stream. When a job is saved it will pick one of the available streams at random. You could potentially do this by using different queues too (one per environment) but I never tried that. This introduces other potential issues (mostly operational), but essentially avoids locking/conflicts completely which is our biggest concern and provided good speed and some scalability.
If you do have the luxury of moving to a different queue system, you should do that.
There was a bug in the database queue transaction handling (no rollback) which caused the freezing of the workers. Take a look at the pull request #22394 (maybe check if you have the jobs_queue_reserved_at_index
index).
The second worker should not pick the job as it is being reserved. If job is being processed twice reconsider tuning retryAfter parameter.
I am currently using the queue with PR merged and it works perfectly for me.
@micmania1 Use show engine innodb status
to debug your deadlock / lock timeout issues. Your theory does not contain the scope of the transactions, and that is important information. There is a SELECT+UPDATE and a SELECT+DELETE that have their own transactions. Activities outside transactions are probably not causing locking issues. (There's technically implicit transactions with auto-commit, so locks held by others are still respected, but the transactions own locks are released after every statement.)
@sisve @micmania1 I have a new example of deadlocks on the DatabaseQueue
.
With #22394 and #22433 merged the DatabaseQueue workers will recover from the deadlock (no freeze) but some jobs will get processed more than once. Here is why and how:
I created a benchmark project for Laravel queueing for testing of various job processing modifications. The project is fresh Laravel 5.5 (created 1 day ago with composer). I generate a synthetic load and measure correctness and performance. The scenario:
--queue=high,default,low --sleep=1 --tries=3
default
, available_at
set to the future.UPDATE jobs SET available_at=0;
which causes job processing by workers.innodb_deadlock_detect=1
, MySQL 5.7.20, InnoDBCREATE TABLE `jobs` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`queue` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`payload` longtext COLLATE utf8mb4_unicode_ci NOT NULL,
`attempts` tinyint(3) unsigned NOT NULL,
`reserved_at` int(10) unsigned DEFAULT NULL,
`available_at` int(10) unsigned NOT NULL,
`created_at` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `jobs_queue_index` (`queue`)
) ENGINE=InnoDB AUTO_INCREMENT=353855 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Deadlock:
------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-12-15 02:57:39 0x7efd583f3700
*** (1) TRANSACTION:
TRANSACTION 3183282, ACTIVE 0 sec fetching rows
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 8 row lock(s)
MySQL thread id 1426, OS thread handle 139626575173376, query id 15379859 localhost 127.0.0.1 laravel Sending data
select * from `jobs` where `queue` = ? and ((`reserved_at` is null and `available_at` <= ?) or (`reserved_at` <= ?)) order by `id` asc limit 1 for update
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 3630 n bits 104 index PRIMARY of table `laravel`.`jobs` trx id 3183282 lock_mode X locks rec but not gap waiting
Record lock, heap no 5 PHYSICAL RECORD: n_fields 9; compact format; info bits 32
0: len 8; hex 00000000000575bb; asc u ;;
*** (2) TRANSACTION:
TRANSACTION 3183286, ACTIVE 0 sec updating or deleting
mysql tables in use 1, locked 1
3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 1437, OS thread handle 139626572363520, query id 15379876 localhost 127.0.0.1 laravel updating
delete from `jobs` where `id` = ?
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 27 page no 3630 n bits 104 index PRIMARY of table `laravel`.`jobs` trx id 3183286 lock_mode X locks rec but not gap
Record lock, heap no 5 PHYSICAL RECORD: n_fields 9; compact format; info bits 32
0: len 8; hex 00000000000575bb; asc u ;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 27 page no 4 n bits 824 index jobs_queue_index of table `laravel`.`jobs` trx id 3183286 lock_mode X locks rec but not gap waiting
Record lock, heap no 719 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
0: len 7; hex 64656661756c74; asc default;;
1: len 8; hex 00000000000575bb; asc u ;;
*** WE ROLL BACK TRANSACTION (2)
Both transactions are locking the same record. The problem is delete
transaction is rolled back. The exception propagates to the worker which continues to the next task. The culprit is an index, this time jobs_queue_index
I guess.
The job was already processed but not removed from the database so it will be picked up after it expires. This can happen multiple times.
It would be preferable the transaction 1 is rolled back (select) instead of the delete but I am not aware this is possible to set.
The funny thing is the following hack solves the problem:
public function deleteReserved($queue, $id)
{
$this->database->transaction(function () use ($queue, $id) {
$this->database->table($this->table)->where('id', $id)->delete();
}, 5);
}
As the worker really insists to delete that job. With magic constant 5 I had no deadlock anymore. There could be implemented some kind of back-off mechanism to try to remove job several times and eventually fail - this could significantly decrease error rate. But it is not a clean approach. (200 jobs per second).
Another way out is adding a new flag field delete_mark
- soft delete indicator / 2-stage delete:
public function deleteReserved($queue, $id)
{
$this->database->transaction(function () use ($queue, $id) {
$this->database->table($this->table)->where('id', $id)->update(['delete_mark' => 1]);
});
$this->database->table($this->table)->where('id', $id)->delete();
}
If worker picks a job with delete_mark > 0
it deletes/ignores it so the job will be processed only once. The update will succeed as it does not modify the index (delete modifies the index). I will test this approach more and if it works correctly I will submit a PR (170 jobs per second).
Another solution is optimistic locking which I am testing right now - no deadlocks. I will publish a blog with benchmarks soon but the performance is not very shiny with the first implementation (slower than pessimistic, 100 jobs per second).
Another solution is removing the index altogether (slow, 90 jobs per second).
If innodb_deadlock_detect=0
then the only options are a) optimistic locking b) removing the index, it seems.
Just for comparison, beanstalkd
queue with the same parameters gives 1080 jobs per second.
I'm looking at the innodb status you've provided, but I do not see the actual culprit.
Your status does not tell us who has already acquired a conflicting lock on jobs_queue_index.
These locks are not gap locks, so it's not the same problem as with the jobs_queue_reserved_at_index. Do you perhaps have a open connection from your desktop where you've accidentally started a long-running transaction and caused some locks?
@sisve thanks for your time! If you want to take a look at this complete report: https://github.com/ph4r05/laravel-queueing-benchmark/blob/master/.github/deadlock07.txt
I think the index is held by select queries. I've disabled deadlock detection and increased timeouts so its easier to debug
set global innodb_deadlock_detect=0;
set global innodb_lock_wait_timeout=1000;
set innodb_lock_wait_timeout=1000;
We have two different transactions going on.
Fetch a new job (DatabaseQueue::pop)
SELECT * FROM jobs WHERE queue = ? AND (...) ORDER BY id ASC LIMIT 1 FOR UPDATE UPDATE jobs SET reserved_at = ? WHERE id = ?
Delete a job after it is processed (DatabaseQueue::deleteReserved)
SELECT * FROM jobs WHERE id = ? FOR UPDATE DELETE FROM jobs WHERE id = ?
I'll be guessing more than usual from here on.
It seems like the select that fetches the jobs issues locks on both the jobs_queue_index and the primary key (guessing that it does it in that order), while the select that exists in deleteReserved only locks the primary key. This allows the fetch-select to grab half (the index) of the locks they require and block waiting for the primary key lock, while the deletion transaction has grabbed the primary key lock and is now blocking on the index lock.
So, while I am still guessing wildly, can we trick the database to issue index locks? Can we modify the locking in deleteReserved so it's based on SELECT * FROM jobs WHERE queue = ? AND id = ? FOR UPDATE
which would hopefully issue identical locks as the fetch-select.
All this is extremely dependent on the table structure. Any other indexes would probably also add the the chaos and madness. Mmm madness...
I made some benchmarks on Amazon c4.large of the Laravel queueing methods (redis, beanstalkd, database):
I also implemented optimistic locking strategy for database driver (no deadlocks), blogpost + repo:
Hey guys,
I've managed to test running 500,000 jobs by 15 workers without any deadlocks with the PR (https://github.com/laravel/framework/pull/31287). thoughts please :)
I'm experiencing similar deadlock behavior with Laravel 8 + Mysql 8 with job batching. From what I can tell, it seems the DatabaseBatchRepository could benefit from the same fix applied in @themsaid's PR(#31287) due to job_batches
being so transaction+lock heavy in it's interaction. ~Should I open a new issue for this?~ I added issue 36478 for my situation.
Laravel 8 with MySQL 5.7. 100 jobs with 8 workers and still run into same error. However, i noticed that job/worker ratio is crucial here. When i had 10 jobs on average with 8 workers, i had more deadlock errors. Thought it might help
Hi everyone, in my project I have tow servers with Laravel 10.48.17, MariaDB, supervisor: first server have around 32 cores, 32 workers second server have around 8 cores, 8 workers
and there is a problem, deadlocks on delete job happening only on second server,
The only difference between their queue configurations is the number of workers
I already expirement with workers size, but that didnt work, maybe I should test only 1 worker, or upgrade database cpu and memory)
I use QUEUE_DRIVER=database When I run more than 10 workers, I get the following error: