preaction / Minion-Backend-mysql

MySQL backend for the 🐙 Minion job runner
Other
7 stars 14 forks source link

acquiring locks fails - Deadlock found when trying to get lock #28

Closed srchulo closed 3 years ago

srchulo commented 4 years ago

I get this error frequently from my jobs when trying to acquire a lock:

DBD::MariaDB::st execute failed: Deadlock found when trying to get lock; try restarting transaction at /home/admin/local/lib/perl5/Minion/Backend/mysql.pm line 403.

I wonder if there's someway to prevent this or retry when this occurs?

https://dev.mysql.com/doc/refman/8.0/en/innodb-deadlocks.html

preaction commented 4 years ago

Hey, thanks for the report! This doesn't seem like its referring to the Minion lock, but the underlying MySQL lock. But the backend doesn't explicitly lock any tables, so I'm going to need some information about how it occurs.

The only thing I can think of right now is to try to make sure that all the things that do work on different tables do that work in the same order, but that doesn't really make sense. Maybe I need to be more aggressive about commits and/or autocommit?

srchulo commented 4 years ago

I don't think it's referring to a literal lock command, I think it has to do with some kind of deadlock related to the minion_lock function:

https://github.com/preaction/Minion-Backend-mysql/blob/8f50b6796a02b4df11ab2c31cee9e0f70649dcce/lib/Minion/Backend/mysql.pm#L1078

What kind of thing are you doing when this happens? Could you make a minimum reproduction case?

I have jobs running that call guard to acquire a lock. These jobs are not calling for the same lock (the lock name is different). I will work on making a minimum reproduction case. I suspect that putting a sleep somewhere in the minion_lock function and then calling for two locks may help.

How many workers do you have? How many enqueued jobs? How many pending jobs trying to obtain a lock on the same Minion resource?

I'm actually just using Minion locks for distributed locking, but my jobs are jobs running on AWS Batch. However, I may have 10-15 jobs running at once, each trying to acquire a lock, but with different names (the same type of job only runs once, so no simultaneous jobs ask for the same lock).

Can your environment use DBD::mysql? If so, does the same thing happen?

I can use DBD::mysql. I am going to switch to this and see if the problem persists (I only see it happen a few times a day, so I may not really be sure if this helps until I make my repro case).

One thing that I should note, is for several months before this I was using the exact same code, but with Minion::Backend::Pg and didn't see this problem (not sure how similar the code is between that and Minion::Backend::mysql, and obviously mysql and Pg are not the same!) :)

Would you be able to run your workers under a module like Carp::Always or Devel::Confess so it creates a full stack trace from the error?

I will see if I can get my jobs to run using one of those modules.

srchulo commented 4 years ago

Can now confirm that I get the same error using DBD::mysql:

DBD::mysql::st execute failed: Deadlock found when trying to get lock; try restarting transaction at /home/admin/local/lib/perl5/Minion/Backend/mysql.pm line 403.

Still working on repro case

preaction commented 4 years ago

Thanks! Also, which version / fork of mysqld are you running? I can add that to the travis tests to try to reproduce.

srchulo commented 4 years ago

It's 5.7, however, it's Amazon Aurora's version, which could make reproduction hard on Travis (and technically, could be causing the issue, although I would think not).

The exact version is 5.7.mysql_aurora.2.07.2

preaction commented 3 years ago

We started hitting this same problem at work, so I was able to get the InnoDB locking information. It appears to be the job.id = @dequeued_job_id := job.id hack that is causing the deadlock: This requires an exclusive lock on the index for the job.id column, which cannot be done while a shared lock was in place (which appeared to be created by the NOT EXISTS (...) subquery).

Doing a proper SELECT ... FOR UPDATE should fix this by removing the need for the exclusive lock, which should prevent any deadlocks.

srchulo commented 3 years ago

Awesome!! Thanks for fixing/figuring this out!