quartznet / quartznet

Quartz Enterprise Scheduler .NET
http://www.quartz-scheduler.net/
Apache License 2.0
6.46k stars 1.68k forks source link

ClusterManager GetConnection() intermittently fails #2037

Open petertirrell opened 1 year ago

petertirrell commented 1 year ago

Describe the bug

Twice now this week we have experienced this issue, after quartz has generally run for a long time without any problems. It causes our application jobs to stop and requires a restart of the application to get the quartz jobs processing again. I've pasted the stack trace of the error below, it looks like there is some sort of SQL Server execution timeout in trying to establish the connection, though it's not clear to me what is executing that would time out. Each time that this has happened we haven't observed anything that looks out of the ordinary on the SQL Server logs so I'm looking for some guidance as to what might be happening during this error or ways we might be able to mitigate it. Is there some sort of retry logic as an option I should be enabling? Quartz has its own dedicated database and we only have a few jobs, and we are only running a single node as well. It seems like there shouldn't be a ton of activity that would be causing the database itself to hang.

I'm exploring updating to the latest library version but I also didn't see anything obvious in the change logs that lined up with what I'm seeing.

Version used

v3.4

To Reproduce

n/a

Expected behavior

I'm either looking for it to retry after the error happens since it seems to be intermittent, or provide a better description of what the problem was.

Additional context Stack trace I'm getting:

[2023-06-07 13:57:24.760 UTC]|[ERROR]|Quartz.Core.ErrorLogger|[149]|An error occurred while scanning for the next trigger to fire.|
Quartz.JobPersistenceException: Failure setting up connection. ---> System.Data.SqlClient.SqlException: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy.
Operation cancelled by user. ---> System.ComponentModel.Win32Exception: The wait operation timed out
   --- End of inner exception stack trace ---
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at System.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
   --- End of inner exception stack trace ---
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<ExecuteInNonManagedTXLock>d__269`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Quartz.Impl.AdoJobStore.JobStoreSupport.<ExecuteInNonManagedTXLock>d__269`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at Quartz.Core.QuartzSchedulerThread.<Run>d__28.MoveNext() [See nested exception: System.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
The request failed to run because the batch is aborted, this can be caused by abort signal sent from client, or another request is running in the same session, which makes the session busy.
Operation cancelled by user. ---> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at System.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
   at Quartz.Impl.AdoJobStore.JobStoreSupport.GetConnection()
ClientConnectionId:5994d099-837e-492f-9c8d-faf690624478
Error Number:-2,State:0,Class:11]
[2023-06-07 13:57:36.642 UTC]|[ERROR]|Quartz.Impl.AdoJobStore.ClusterManager|[145]|Error managing cluster: Failure setting up connection.|
Quartz.JobPersistenceException: Failure setting up connection. ---> System.Data.SqlClient.SqlException: Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
Operation cancelled by user. ---> System.ComponentModel.Win32Exception: The wait operation timed out
lahma commented 1 year ago

Usually this is a sign of something blocking the server, like running backups. This error should be caught and should not stop cluster manager from operating.