michaelklishin / quartz-mongodb

A MongoDB-based store for the Quartz scheduler. This fork strives to be as feature complete as possible. Originally by MuleSoft.
Other
247 stars 199 forks source link

non-clustered quartz no longer relock when lock expires and does not execute job when TriggerAndJobPersister.storeTrigger unable to connect mongo for a moment #192

Closed parashivamurthykr closed 4 years ago

parashivamurthykr commented 4 years ago

Quartz Scheduler is running in non-cluster mode with same instanceId, while retrieving documents from quarts_job collection , it throws com.mongodb.MongoSocketReadException: Prematurely reached end of stream and catched in QuartzSchedulerThread class run method.

QuartzScheduler_QuartzSchedulerThread] trace: trnx: ERROR o.quartz.core.QuartzSchedulerThread o.quartz.core.QuartzSchedulerThread.run 435 - Runtime error occurred in main trigger firing loop.

at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:92) com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:425) com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:255) com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:200) com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131) com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123) com.mongodb.operation.CommandOperationHelper.executeWrappedCommandProtocol(CommandOperationHelper.java:233) com.mongodb.operation.FindOperation$1.call(FindOperation.java:701) com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:457) com.mongodb.operation.FindOperation.execute(FindOperation.java:695) com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:179) com.mongodb.client.internal.FindIterableImpl.first(FindIterableImpl.java:198) com.novemberain.quartz.mongodb.TriggerAndJobPersister.storeTrigger(TriggerAndJobPersister.java:118) com.novemberain.quartz.mongodb.MongoDBJobStore.triggersFired(MongoDBJobStore.java:403) org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:353)

c.n.q.mongodb.cluster.CheckinTask c.n.q.mongodb.cluster.CheckinTask checks in as it is able to connect to MongoDB after 8 seconds from the time Scheduler has thrown MongoSocketReadException.

A lock had been been created when scheduler had thrown Exception (due to mongo connectivity). An entry was there in Quartz_trigger and QuartzLock when scheduler restarted. So Scheduler failed while putting(inserting) lock as there was already a lock.

TriggerRunner.acquireNextTriggers LockManager.tryLock(key)

c.n.quartz.mongodb.LockManager c.n.quartz.mongodb.LockManager.tryLock 65 - Failed to lock trigger database_provisioner.database_provisioner, reason: WriteError{code=11000, message='E11000 duplicate key error collection: analytics_scheduler.quartz_locks index: keyGroup_1_keyName_1_type_1 dup key: { : "sample", : "sample", : "t" }', details={ }}

Then it tries to execute LockManager.relockExpired to relock but it is unable to relock because of the below reason. while executing ExpiryCalculator.isLockExpired method calculated elapsed time (clock.millis() - lockTime.getTime()) is less than the timeoutMillis(10 mins which is initialised in Jobstore initialisation) long triggerTimeoutMillis = 10 60 1000L;

So in this case, scheduler is unable to relock as the lock is not expired as per the above calculations. Scheduler can relock only if below conditions matches 1) isLockExpired method - long elapsedTime = clock.millis() - lockTime.getTime(); if (elapsedTime > 10 60 1000L); 2) hasDefunctScheduler method- scheduler.isDefunct(clock.millis()) && schedulerDao.isNotSelf(scheduler); 1) (lastCheckinTime + checkinInterval + TIME_EPSILON ) < clock.millis() where lastCheckinTime is last check in time by scheduler and checkinInterval,TIME_EPSILON is by defualt 7500. So scheduler should not be checking for atleast last 15 secs for scheduler to be defunctional. 2) schedulerDao.isNotSelf(scheduler); scheduler instance Id should not be same as the current scheduler which is trying to relock.

In our case scheduler has same instanceId as it is running in non_cluster mode. And it failed to relock as the lock has created in last 10 mins. Scheduler needs a lock which is not created in last 10 mins to make it expire and relock.

So each time when the trigger becomes eligible to run , It is unbale to create lock as an already exists which is created before throwing MongoSocketReadException. When it tries to relock it fails as lock has not created in last 10 mins.

The above flow continues as we arrive at a time where lock created time is 10 mins past than the current time. So we arrived at a time where lock created time is 10 mins past than the current time.

So now it passes the first condition(isLockExpired method) to relock as lock time is past 10 mins of current time (long elapsedTime = clock.millis() - lockTime.getTime(); if (elapsedTime > 10 60 1000L);) But its fails to pass the hasDefunctScheduler method below conditions 1) (lastCheckinTime + 15 sec) can not be less than current time ( Scheduler is able to run c.n.q.mongodb.cluster.CheckinTask task which updates lastCheckinTime ) 2) schedulerDao.isNotSelf(scheduler) condtion can never be met as it has lock as same instanceid as scheduler's instanceId (Schduler is running in non-cluster mode with same instance id)

So Scheduler will never be able to execute the task as it is unable to relock the expired lock. This lock can only be recovered when scheduler restarts.

For the above scenario eventhough it runs in cluster mode hasDefunctScheduler method 1st condtion can not be met as scheduler checkin task is updating the checking interval time .

We have seen this issue in our production where com.novemberain.quartz.mongodb.TriggerAndJobPersister.storeTrigger(TriggerAndJobPersister.java:118) is unable to connect to mongo and throws exception after 7 secs . After 15s schdeuler is able to execute c.n.q.mongodb.cluster.CheckinTask method and c.n.q.mongodb.cluster.CheckinTask is always able to connect to mongo and never thrown an exception)

To reproduce this issue. Bring the mongo db down while it is executing below method com.novemberain.quartz.mongodb.TriggerAndJobPersister.storeTrigger(TriggerAndJobPersister.java:118 jobDao.getJob(Keys.toFilter(newTrigger.getJobKey()));

and bring mongo up as it throws exception.

Inorder to resolve above corner case issue. we should have a check for whether scheduler is running in clustered mode or not and if it is not running in clustered mode we should not check for hasDefunctScheduler method while relockExpired method execution as it will always have same instance id and same scheduler instance updates lastCheckinTime . hasDefunctScheduler has to be changes such that it should return true for non-clustered scheduler.

Also current socket connection and read timeout is by default 2000.So we get control to set this we can might not come across the above corner case as it happen when it unable to connect to mongo for 10-20 secs.

We should have quartz properties using which we can set maxConnectionIdleTime, socket connectiontimeout and socket readtimeout.

@michaelklishin The above schenario is also mentioned in below issue link

https://github.com/michaelklishin/quartz-mongodb/issues/114#issuecomment-366423026

One other example scenario happened for me is: My mongo got down and the main thread broke because of following error: [2018-02-13 21:24:48.819] [scheduler_QuartzSchedulerThread] [ERROR] [o.quartz.core.QuartzSchedulerThread] [Runtime error occurred in main trigger firing loop.] [com.mongodb.MongoSocketReadException: Exception receiving message.

This results in some locks not getting cleaned (unlock). When the main thread runs again (as it is in while true block) it calls qsRsrcs.getJobStore().acquireNextTriggers which then calls lockManager.relockExpired for these locks and expiryCalculator.isTriggerLockExpired(existingLock) returns false even after triggerTimeoutMillis because the hasDefunctScheduler implementation inside isTriggerLockExpired doesn't take care of the case when the scheduler is non clustered.

It can be solved if we replace following

private boolean hasDefunctScheduler(String schedulerId) { Scheduler scheduler = schedulerDao.findInstance(schedulerId); if (scheduler == null) { log.debug("No such scheduler: {}", schedulerId); return false; } return scheduler.isDefunct(clock.millis()) && schedulerDao.isNotSelf(scheduler); } to:

private boolean hasDefunctScheduler(String schedulerId) { //check if it is not clustered then return true; Scheduler scheduler = schedulerDao.findInstance(schedulerId); . . . }