michaelklishin / quartz-mongodb

A MongoDB-based store for the Quartz scheduler. This fork strives to be as feature complete as possible. Originally by MuleSoft.
Other
250 stars 202 forks source link

Mongo WriteConcern W parameter configurable in quartz.properties #185

Closed gregRzn closed 4 years ago

gregRzn commented 4 years ago

Long story short: Having only 2 mongo replica set data nodes with hardcoded write concern 'majority' it makes whole quartz stuck when one mongo node is down.

Details: Hello, I had to create a fork to be able to change Mongo WriteConcern W parameter. The default one is 'majority'. With my change there is possibility to set different W parameter in quartz.properties like: org.quartz.jobStore.mongoOptionWriteConcernW=W1

Explanation: I have a case where there is a mongo replica set that contains primary, secondary and arbiter nodes. The secondary node is used only as backup, it's not used for reading. I set in quartz.properties: org.quartz.jobStore.checkInErrorHandler=com.novemberain.quartz.mongodb.cluster.NoOpErrorHandler

to prevent JVM crash with KamikazeErrorHandler when one of the mongo nodes is down. In this case the Spring and mongo driver handles disabling one of mongo nodes properly. However mongo-quartz has hardcoded write concern 'majority'. This means that it should confirm save operation on 2 nodes but there is only one active node. When one of the nodes is down the method: LockManager::tryToLock

physically saves lock to the working DB node but it also throws MongoWriteConcernException. This exception leads to a problem where lock is in DB but the condition here TriggerRunner.java:124 is not met:

if (lockManager.tryLock(key)) {
                if (prepareForFire(noLaterThanDate, trigger)) {

So prepareForFire method is never invoked and whole quartz is stuck.

A few line below there is a code: } else if (lockManager.relockExpired(key)) { which purpose is to remove expired locks. However in my case this never happens because when lock time passes there is another condition that is not met here: com.novemberain.quartz.mongodb.util.ExpiryCalculator#isTriggerLockExpired

return isLockExpired(lock, triggerTimeoutMillis) && hasDefunctScheduler(schedulerId);

I don't have defunct quartz node so this is always false and prevent relocking.

When I set write concern as W1 it makes mongo-quartz working with one mongo node and it's back to normal when the disabled mongo node is working again.

Bonus: Here's a link to mongo documentation: https://docs.mongodb.com/manual/reference/write-concern/#calculating-majority-for-write-concern

when you scroll to the bottom of the page there is a tip and they advise to avoid using majority write concern in Primary - Secondary - Arbiter architecture (PSA):

Avoid using a "majority" write concern with a (P-S-A) or other topologies that require all data-bearing voting members to be available to acknowledge the writes. Customers who want the durability guarantees of using a "majority" write concern should instead deploy a topology that does not require all data bearing voting members to be available (e.g. P-S-S).

This is an architecture that I'm using and that's why I need to overwrite default write concert in mongo-quartz.