vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Handling errors that occur within JobQueue #211

Open luixal opened 7 years ago

luixal commented 7 years ago

Hi,

I'm running JobCollection using a remote MongoDB (they run on different machines inside a local network) and, from time to time, I'm getting this exception thrown:

JobQueue: Received error from getWork():  { MongoError: connection 5 to 10.130.52.39:27017 timed out
    at Object.wait (/opt/servoscheduler/app/programs/server/node_modules/fibers/future.js:449:15)
    at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1024:47)
    at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1058:22)
    at SynchronousCursor.map (packages/mongo/mongo_driver.js:1068:10)
    at Cursor.(anonymous function) [as map] (packages/mongo/mongo_driver.js:907:44)
    at JobCollectionBase._DDPMethod_getWork (packages/vsivsi_job-collection/src/shared.coffee:445:12)
    at JobCollection.<anonymous> (packages/vsivsi_job-collection/src/server.coffee:147:22)
    at packages/vsivsi_job-collection/src/server.coffee:80:50
    at withValue (packages/meteor.js:1122:17)
    at packages/meteor.js:445:45
    - - - - -
    at Function.MongoError.create (/opt/servoscheduler/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/error.js:29:11)
    at Socket.<anonymous> (/opt/servoscheduler/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:176:20)
    at Socket.g (events.js:291:16)
    at emitNone (events.js:86:13)
    at Socket.emit (events.js:185:7)
    at Socket._onTimeout (net.js:339:8)
    at ontimeout (timers.js:365:14)
    at tryOnTimeout (timers.js:237:5)
    at Timer.listOnTimeout (timers.js:207:5)
  name: 'MongoError',
  message: 'connection 5 to 10.130.52.39:27017 timed out' }
Exception in setInterval callback: MongoError: connection 5 to 10.130.52.39:27017 timed out
    at Object.wait (/opt/servoscheduler/app/programs/server/node_modules/fibers/future.js:449:15)
    at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1024:47)
    at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1058:22)
    at Cursor.(anonymous function) [as forEach] (packages/mongo/mongo_driver.js:907:44)
    at JobCollection._promote_jobs (packages/vsivsi_job-collection/src/server.coffee:194:10)
    at withValue (packages/meteor.js:1122:17)
    at packages/meteor.js:445:45
    at runWithEnvironment (packages/meteor.js:1176:24)
    - - - - -
    at Function.MongoError.create (/opt/servoscheduler/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/error.js:29:11)
    at Socket.<anonymous> (/opt/servoscheduler/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb-core/lib/connection/connection.js:176:20)
    at Socket.g (events.js:291:16)
    at emitNone (events.js:86:13)
    at Socket.emit (events.js:185:7)
    at Socket._onTimeout (net.js:339:8)
    at ontimeout (timers.js:365:14)
    at tryOnTimeout (timers.js:237:5)
    at Timer.listOnTimeout (timers.js:207:5)

Any idea on how to solve or handle this?

thanks!

vsivsi commented 7 years ago

What environment is your worker code running in? Meteor (server) or node.js? Under meteor, jobcollection doesn't do anything special beyond what any other Meteor collection does regarding connection handling to the MongoDB. In fact it is using the same code. Under plain node.js, you are responsible for handling all MongoDB connection issues yourself, as you would with any other node.js app that connects to a database. Hope that helps.

luixal commented 7 years ago

I'm running two different mongo apps:

{
  "public": {
    "version": "0.0.1",
    "environment": "production"
  },
  "env": {
    "PORT": "3333",
    "MONGO_URL": "mongodb://10.130.52.39:27017/app"
  }
}

I guess the problem is that, eventually, the connection to the mongodb gets broken and causes that exception (which will be not problem, I would only prefer to catch the exception and generate a custom log), but I think (can't asure this, but 99%) that the jobServer (or the whole second meteor app) doesn't try to re-connect.

Any ideas?

vsivsi commented 7 years ago

Hi, okay there are two issues here that I can see.

1) You are using a remote MongoDB instance for the backing DB for your job-collection "app". As you note, this is not a problem in itself, but connection management issues will eventually arise (as you are seeing). All I can really say about it is that a jobCollection instance under these circumstances is identical to a regular old Meteor Collection. Anything you would need to do to keep that DB connection healthy for a Meteor Collection, you will also need to do for a jobCollection. There is no difference.

2) SInce you are using JobQueue to manage your workers (which is a level of abstraction higher than the jc.getWork() mechanism is it built upon), you give up some flexibility and control. Your jobQueue instance calls jc.getWork() internally (via a Meteor method call) and when handled on the server, it is that method call that interacts with MongoDB via the underlying Meteor Collection.

Since the JobQueue (returned by processJobs()) is a long running async process meant to survive transient errors, etc, and meant to run remotely from the server it is connected to (e.g. in a vanilla node.js app, without Meteor), it does not assume or have access to any of the Meteor error handling features. Its only interaction with the server is via the aforementioned Meteor method generated by getWork(), which can fail in a number of ways.

Right now, JobQueue simply logs such failures using console.error(). The code for this is here: https://github.com/vsivsi/meteor-job/blob/b7b87d0d81f38f4eac5d67f0e8280277adc954a2/src/job_class.coffee#L154

This isn't ideal if you would like to capture these errors and log them elsewhere or otherwise act on them. I can think of a couple of ways to remedy this issue. One you can do without my help is to create a custom version of the JobQueue class that does something else at the linked line of code above. That's not great, because you would essentially be forking that class, but it would work.

I'd be happy to add some functionality to JobQueue to try to cover this case. Here are some ideas:

jc.processJobs() could accept an optional callback function that is invoked anytime there is an internal error, and if no callback is provided the current behavior will continue. This would be the simplest to implement, but isn't terribly flexible.

JobQueue (returned by jc.processJobs()) could become an event emitter, allowing you to register a handler for various events, the first of which could be an 'error' event covering this case. This is obviously much more flexible for handling other future types of events, etc. But is a bit more work to implement.

Thoughts on this?

vsivsi commented 7 years ago

@luixal I have prototyped the first idea above: adding an optional error callback to processJobs that receives any runtime errors from processJobs and silences the default console.error output. It is implemented on this branch, but I haven't yet integrated it into the meteor package.

https://github.com/vsivsi/meteor-job/tree/errorCallback#q--jobprocessjobsroot-type-options-worker

vsivsi commented 7 years ago

These changes are now also on the errorCallback branch of the main meteor-job-collection project. https://github.com/vsivsi/meteor-job-collection/tree/errorCallback

luixal commented 7 years ago

Will try including the errorCallback branch providing a simple logs to see the info dropped when I get the error (not surely when it will happen) and come back here.

I guess the problem is that, when an error closes/hangs the connection, job server gets stuck and I would need to reestablish it and restart the job server, which I could perfectly do from this callback.

Thanks!

vsivsi commented 7 years ago

Okay, let me know how it goes. Once you confirm that this solution works for you I can publish the changes to npm/Atmosphere within a day.