Clustering: CheckinTask should not shut JVM down on failure

Gonzo17 commented 7 years ago

Hey guys,

we have some troubles with the CheckinTask because it shuts down the JVM on any failure within the task.

In our setup we use a MongoDB Cluster to schedule tasks. Due to network problems our microservice lost connection to the MongoDB Cluster and after a timeout of 30s the service was shutdown:

2017-05-03 14:41:22.093 ERROR 50307 --- [pool-1-thread-1] c.n.quartz.mongodb.cluster.CheckinTask   : Node KAMPI000000951493815151972 could not check-in: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=STANDALONE, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused}}]

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=STANDALONE, servers=[{address=localhost:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketOpenException: Exception opening socket}, caused by {java.net.ConnectException: Connection refused}}]
    at com.mongodb.connection.BaseCluster.createTimeoutException(BaseCluster.java:369)
    at com.mongodb.connection.BaseCluster.selectServer(BaseCluster.java:101)
    at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:75)
    at com.mongodb.binding.ClusterBinding$ClusterBindingConnectionSource.<init>(ClusterBinding.java:71)
    at com.mongodb.binding.ClusterBinding.getWriteConnectionSource(ClusterBinding.java:68)
    at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:219)
    at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:168)
    at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:74)
    at com.mongodb.Mongo.execute(Mongo.java:781)
    at com.mongodb.Mongo$2.execute(Mongo.java:764)
    at com.mongodb.MongoCollectionImpl.executeSingleWriteRequest(MongoCollectionImpl.java:515)
    at com.mongodb.MongoCollectionImpl.update(MongoCollectionImpl.java:508)
    at com.mongodb.MongoCollectionImpl.updateOne(MongoCollectionImpl.java:355)
    at com.novemberain.quartz.mongodb.dao.SchedulerDao.checkIn(SchedulerDao.java:71)
    at com.novemberain.quartz.mongodb.cluster.CheckinTask.run(CheckinTask.java:46)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

2017-05-03 14:41:22.098  INFO 50307 --- [       Thread-4] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_KAMPI000000951493815151972 paused.
2017-05-03 14:41:22.104  INFO 50307 --- [       Thread-4] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_KAMPI000000951493815151972 shutting down.
2017-05-03 14:41:22.104  INFO 50307 --- [       Thread-4] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_KAMPI000000951493815151972 paused.
2017-05-03 14:41:50.194  INFO 50307 --- [       Thread-4] c.n.q.mongodb.cluster.CheckinExecutor    : Stopping CheckinExecutor for scheduler instance: KAMPI000000951493815151972
2017-05-03 14:41:50.195  INFO 50307 --- [       Thread-4] org.mongodb.driver.connection            : Closed connection [connectionId{localValue:4, serverValue:7}] to localhost:27017 because there was a socket exception raised on another connection from this pool.
2017-05-03 14:41:50.195  INFO 50307 --- [       Thread-4] org.mongodb.driver.connection            : Closed connection [connectionId{localValue:3, serverValue:6}] to localhost:27017 because there was a socket exception raised on another connection from this pool.
2017-05-03 14:41:50.196  INFO 50307 --- [       Thread-4] org.quartz.core.QuartzScheduler          : Scheduler schedulerFactoryBean_$_KAMPI000000951493815151972 shutdown complete.
2017-05-03 14:41:50.197  INFO 50307 --- [       Thread-4] org.mongodb.driver.connection            : Closed connection [connectionId{localValue:5, serverValue:8}] to localhost:27017 because there was a socket exception raised on another connection from this pool.

Process finished with exit code 1

So my problem is that I want my microservice to attempt a reconnect to the database instead of shutting down. If I dont use the Cluster that works well. I understand that there is a need to prevent a job from being executed twice in a Cluster. But what about pausing all triggers instead of shutting down the JVM or the scheduler?

Here is my configuration:

org.quartz.scheduler.instanceName=test-scheduler
org.quartz.scheduler.instanceId=AUTO
org.quartz.threadPool.threadCount=1
org.quartz.jobStore.class=com.novemberain.quartz.mongodb.MongoDBJobStore
org.quartz.jobStore.misfireThreshold=60000
org.quartz.jobStore.collectionPrefix=quartz_
org.quartz.jobStore.dbName=test
org.quartz.jobStore.isClustered=true
org.quartz.jobStore.clusterCheckinInterval=20000
org.quartz.jobStore.addresses: server1:27017,server2:27017
org.quartz.jobStore.username: user
org.quartz.jobStore.password: pw

michaelklishin commented 7 years ago

I don't see how this job store can shut down the entire JVM and there is no evidence of that in the log. There is an exception in a Quartz thread pool, that's it. Sorry but this is almost certainly a red herring.

Gonzo17 commented 7 years ago

I don't see how this job store can shut down the entire JVM

See Line 26 in CheckinTask :

System.exit(1);

michaelklishin commented 7 years ago

On, that is fucked up :( @Gonzo17 I'm not sure how Quartz jobs are supposed to terminate cleanly but definitely feel free to submit a PR that removes the System.exit call with something more reasonable (such as logging).

michaelklishin commented 7 years ago

That said, the comment for that runnable suggests that this is not a rookie mistake. There is no obvious way to stop just Quartz.

Gonzo17 commented 7 years ago

Thanks for your quick reply @michaelklishin . I don't know very much about Quartz yet but my first idea was to pause or standby the scheduler (difference discussed here). However, I don't know how to access the scheduler there and when to resume scheduler.

eonwhite commented 7 years ago

This issue unfortunately is a hard dealbreaker for using this library in production. It's just not tenable to shut down the whole JVM over the tiniest transient network hiccup.

I wish I understood Quartz better (or at all) or I'd submit a fix.

Maybe rather than attempting to stop the Quartz scheduler (which it does not seem that the job store has direct access to), maybe we could set a flag on the job store on a failed checkin, that causes calls like acquireNextTriggers() to return empty results until the next successful checkin? Would something like that be sufficient to address the issue? I'm happy to work on a PR for this if someone can give me guidance on whether this is a viable approach.

Or if we don't know the correct fix, alternatively maybe we could add a config setting to turn off this shutdown behavior. Perhaps combined with more configurability around the period of time that other cluster members have to wait to declare a scheduler "defunct".

@michaelklishin @pwojnowski thoughts?

Gonzo17 commented 6 years ago

Hey guys, are there any news on this topic? At the moment we need to restart our application after a connection loss to the database. We can live with that because our service does not work with critical data and this happens only once or twice a month. But as mentioned by @eonwhite this is kind of a dealbreaker to really use it in production for critical data.

michaelklishin commented 6 years ago

Triggering (no pun intended) connection recovery is something I'd investigate.

@Gonzo17 this is open source software. If something is a dealbreaker for you, feel free to investigate a solution and submit a PR.

michaelklishin commented 6 years ago

I introduced a way to opt out. Property pausing and unpausing Quartz is still TBD (and needs some research of JDBC stores to see what they do).

michaelklishin / quartz-mongodb

Clustering: CheckinTask should not shut JVM down on failure #147