quartz-scheduler / quartz

Code for Quartz Scheduler
http://www.quartz-scheduler.org
Apache License 2.0
6.3k stars 1.94k forks source link

Deadlock of TRIGGER table #1244

Open MonarchNing opened 3 weeks ago

MonarchNing commented 3 weeks ago

We've the quartz scheduler running production in cluster mode and noticed the row level lock acquired on TRIGGERS table is causing deadlock.

Setup

DB2v11.1 Java8 Quartz 2.3.2

Exception: [10/26/24 23:10:05:946 HKT] 0000010d SystemOut O 2024-10-26 23:10:05 ERROR [org.quartz.core.ErrorLogger]: An error occurred while scanning for the next triggers to fire. org.quartz.JobPersistenceException: Couldn't acquire next trigger: The current transaction has been rolled back because of a deadlock or timeout. Reason code "2".. SQLCODE=-911, SQLSTATE=40001, DRIVER=4.25.1301 [See nested exception: com.ibm.db2.jcc.am.SqlTransactionRollbackException: The current transaction has been rolled back because of a deadlock or timeout. Reason code "2".. SQLCODE=-911, SQLSTATE=40001, DRIVER=4.25.1301] at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTrigger(JobStoreSupport.java:2923) at org.quartz.impl.jdbcjobstore.JobStoreSupport$41.execute(JobStoreSupport.java:2805) at org.quartz.impl.jdbcjobstore.JobStoreSupport$41.execute(JobStoreSupport.java:2803) at org.quartz.impl.jdbcjobstore.JobStoreSupport.executeInNonManagedTXLock(JobStoreSupport.java:3864) at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTriggers(JobStoreSupport.java:2802) at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:287) at hk.com.mtrc.etms.batch.scheduler.DelegatingWork.run(WorkManagerThreadExecutor.java:83) at com.ibm.ws.asynchbeans.J2EEContext$RunProxy.run(J2EEContext.java:277) at java.security.AccessController.doPrivileged(AccessController.java:716) at javax.security.auth.Subject.doAs(Subject.java:490) at com.ibm.websphere.security.auth.WSSubject.doAs(WSSubject.java:133) at com.ibm.websphere.security.auth.WSSubject.doAs(WSSubject.java:91) at com.ibm.ws.asynchbeans.J2EEContext$DoAsProxy.run(J2EEContext.java:348) at java.security.AccessController.doPrivileged(AccessController.java:746) at com.ibm.ws.asynchbeans.J2EEContext.run(J2EEContext.java:1042) at com.ibm.ws.asynchbeans.WorkWithExecutionContextImpl.go(WorkWithExecutionContextImpl.java:199) at com.ibm.ws.asynchbeans.CJWorkItemImpl.run(CJWorkItemImpl.java:237) at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1909) Caused by: com.ibm.db2.jcc.am.SqlTransactionRollbackException: The current transaction has been rolled back because of a deadlock or timeout. Reason code "2".. SQLCODE=-911, SQLSTATE=40001, DRIVER=4.25.1301 at com.ibm.db2.jcc.am.b6.a(b6.java:797) at com.ibm.db2.jcc.am.b6.a(b6.java:66) at com.ibm.db2.jcc.am.b6.a(b6.java:140) at com.ibm.db2.jcc.am.k3.c(k3.java:2824) at com.ibm.db2.jcc.t4.ab.x(ab.java:1827) at com.ibm.db2.jcc.t4.ab.n(ab.java:950) at com.ibm.db2.jcc.t4.ab.a(ab.java:120) at com.ibm.db2.jcc.t4.p.a(p.java:50) at com.ibm.db2.jcc.t4.aw.b(aw.java:220) at com.ibm.db2.jcc.am.k4.bm(k4.java:3599) at com.ibm.db2.jcc.am.k4.a(k4.java:4644) at com.ibm.db2.jcc.am.k4.b(k4.java:4182) at com.ibm.db2.jcc.am.k4.be(k4.java:827) at com.ibm.db2.jcc.am.k4.executeUpdate(k4.java:801) at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.pmiExecuteUpdate(WSJdbcPreparedStatement.java:1304) at com.ibm.ws.rsadapter.jdbc.WSJdbcPreparedStatement.executeUpdate(WSJdbcPreparedStatement.java:845) at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.updateTriggerStateFromOtherState(StdJDBCDelegate.java:1439) at org.quartz.impl.jdbcjobstore.JobStoreSupport.acquireNextTrigger(JobStoreSupport.java:2901) ... 17 more According to the updateTriggerStateFromOtherState function, Ifound the update sql script and add index on QRTZ_TRIGGERS table (SCHED_NAME, TRIGGER_NAME, TRIGGER_GROUP, TRIGGER_STATE) but this issue can not be fixed.

jhouserizer commented 3 weeks ago

"The current transaction has been rolled back because of a deadlock or timeout. Reason code "2"." -- you're almost certainly hitting a timeout, not a deadlock. If there was a possible deadlock there would be lots of users reporting it (since there are millions of apps using Quartz).

If you've already added indexes, I'm not sure what else you can do to speed up your DB. Sounds like it's pretty overwhelmed. How many nodes in your cluster?

MonarchNing commented 3 weeks ago

Thanks a lot for your comments, there are just 2 nodes hot standby! yes ,maybe it is a timeout issue, no user report it , but when in 23:00 pm , some jobs running at this time and the issue arise, there will be db connection increase, user report the function running slowly, Usually this deadlock or time out running on the hour(when job start to run), even every 20 minutes or 30 minutes the deadlock or time out issue arise, in the quartz.propertie file, there is no configuration about txIsolationLevelSerializable or acquireTriggersWithinLock,

so I set org.quartz.jobStore.txIsolationLevelSerializable=true or org.quartz.jobStore.acquireTriggersWithinLock=true can help this? thanks for your help. By the way, this issue caused by our quartz upgrade from 2.0.2 to 2.3.2, maybe there need some new configration in quartz.propertie file? I do not know what cause this issue.

"The current transaction has been rolled back because of a deadlock or timeout. Reason code "2"." -- you're almost certainly hitting a timeout, not a deadlock. If there was a possible deadlock there would be lots of users reporting it (since there are millions of apps using Quartz).

If you've already added indexes, I'm not sure what else you can do to speed up your DB. Sounds like it's pretty overwhelmed. How many nodes in your cluster?