V3.1.1 - Job runner stopped, jobs are not running anymore

vparpoil commented 4 years ago

Hello, We have a setup with one instance of Meteor runing daily jobs : 1 running once a day, the other runing twice a day. I just figured out that after my last app restart one month ago the job runner stopped working, so no jobs have been running during the last month :/ . This restart happened on may 24th at 15:38 CEST, and the update of jobs_dominator_3 collection stopped working a few hours later. No job ran in the period of time it was active. Collection jobs_dominator_3 contains this :

{ 
    "_id" : "vjAuiwYqsDxiggGoH", 
    "serverId" : "RwQroqTtD7DBJeDX5", 
    "created" : ISODate("2020-05-24T13:39:10.442+0000"), 
    "lastPing" : ISODate("2020-05-24T18:30:12.180+0000")
}

Since the lastPing is stuck one month ago, about one hour after the last app restart, I think some kind of internal process crashed but I cannot find any log about it. Any hint on what may have caused the crash ? Suggestions to avoid future crash of this kind ? Is it possible that a temporary mongo unavailability caused this ? On our side we will implement external monitoring of running jobs. Thanks

msavin commented 4 years ago

Hey Victor, thanks for reporting this. I haven’t seen this kind of issue before and I’ve used the queue in many places.

If I did have issues with it before, it would typically be due to the database. Can you tell me more about your database setup and what kind of jobs code you have set up?

Sent from my iPhone

On Jun 25, 2020, at 2:03 PM, Victor Parpoil notifications@github.com wrote:

Hello, We have a setup with one instance of Meteor runing daily jobs : 1 running once a day, the other runing twice a day. I just figured out that after my last app restart one month ago the job runner stopped working, so no jobs have been running during the last month :/ . This restart happened on may 24th at 15:38 CEST, and the update of jobs_dominator_3 collection stopped working a few hours later. No job ran in the period of time it was active. Collection jobs_dominator_3 contains this :

{ "_id" : "vjAuiwYqsDxiggGoH", "serverId" : "RwQroqTtD7DBJeDX5", "created" : ISODate("2020-05-24T13:39:10.442+0000"), "lastPing" : ISODate("2020-05-24T18:30:12.180+0000") } Since the lastPing is stuck one month ago, about one hour after the last app restart, I think some kind of internal process crashed but I cannot find any log about it. Any hint on what may have caused the crash ? Suggestions to avoid future crash of this kind ? Is it possible that a temporary mongo unavailability caused this ? On our side we will implement external monitoring of running jobs. Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

vparpoil commented 4 years ago

Thank you for your answer. Database setup is simple : one single mongo (v4.2.1) node running on a distinct server with replication in place (but no nodes to replicate to, we only set it as a single replica of a replica set to use Oplog tailing on the meteor side) Jobs are checking data into the database to send emails to specific users. I did saw some error log in my server log like this one, but I'm not sure how this is relevant :

Exception in setInterval callback: MongoError: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1590604213, 1) } with id: 6797469174001041409
    at Connection.<anonymous> (/var/www/application/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:450:61)

vparpoil commented 4 years ago

This happened again. Here is the log from the time when jobs_dominator_3 collection stopped updating:

Exception in setTimeout callback: Error: read ECONNRESET  at TCP.onStreamRead (internal/stream_base_commons.js:205:27)
Exception in setInterval callback: MongoError: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1593369058, 1) } with id: 6797469174001041409
 at Connection.<anonymous> (/var/www/application/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:450:61)

msavin commented 4 years ago

Looks like a database issue. Are you running it with only one server?

Sent from my iPhone

On Jun 29, 2020, at 8:57 PM, Victor Parpoil notifications@github.com wrote:

This happened again. Here is the log from the time when jobs_dominator_3 collection stopped updating:

Exception in setTimeout callback: Error: read ECONNRESET at TCP.onStreamRead (internal/stream_base_commons.js:205:27) Exception in setInterval callback: MongoError: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1593369058, 1) } with id: 6797469174001041409 at Connection. (/var/www/application/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:450:61) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vparpoil commented 4 years ago

Yes, only one server. I updated to v4 and I have the same issue. Any hint on where to look in your code to add logging in order to debug this ?

msavin commented 4 years ago

I think it just needs a try / catch statement around the Dominator query, that way the code will keep looping even if the database throws an error. I'll try to get to it soon!

On Thu, Jul 2, 2020 at 2:50 PM Victor Parpoil notifications@github.com wrote:

Yes, only one server. I updated to v4 and I have the same issue. Any hint on where to look in your code to add logging in order to debug this ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/msavin/SteveJobs/issues/92#issuecomment-652893747, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAPP5A2CDY535EUZZYQL7ITRZRGONANCNFSM4OIDP2UA .

vparpoil commented 4 years ago

I think I got it for my setup: this mongo unavailability raise when runing backups of the database. I added some try/catch blocks to dominator queries (find and remove), I will let you know if they fix the issue and propose the update.

msavin commented 4 years ago

Yep the error probably halts dominator. How’s it working for ya?

Sent from my iPhone

On Jul 6, 2020, at 1:33 AM, Victor Parpoil notifications@github.com wrote:

I think I got it for my setup: this mongo unavailability raise when runing backups of the database. I added some try/catch blocks to dominator queries (find and remove), I will let you know if they fix the issue and propose the update.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

vparpoil commented 4 years ago

Thanks for following up 👍 I tried to add a few try/catch in the code but I couldn't make the jobs to run again as excepted (errors are now caught and I don't see any other uncaught in the logs that are directly related...). Here are the functions where I added a try/catch and where I saw errors being caught:

dominator.getActive
queue.prototype.run
dominator.setAsActive

I still have the following uncaught but I'm not sure how they are related to the current issue:

  '0': 'Exception in setInterval callback: MongoError: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1594405818, 1) } with id: 6797469174001041409\n' +
    '    at Connection.<anonymous> (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:450:61)\n' +
    '    at Connection.emit (events.js:311:20)\n' +
    '    at Connection.EventEmitter.emit (domain.js:482:12)\n' +
    '    at processMessage (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:384:10)\n' +
    '    at Socket.<anonymous> (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:553:15)\n' +
    '    at Socket.emit (events.js:311:20)\n' +
    '    at Socket.EventEmitter.emit (domain.js:482:12)\n' +
    '    at addChunk (_stream_readable.js:294:12)\n' +
    '    at readableAddChunk (_stream_readable.js:275:11)\n' +
    '    at Socket.Readable.push (_stream_readable.js:209:10)\n' +
    '    at TCP.onStreamRead (internal/stream_base_commons.js:186:23)'

and that one, that seems related to zuuk:stale-session

'0': 'Exception in setInterval callback: MongoError: Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1594405818, 1) } with id: 6797469174001041409\n' +
    '    at Connection.<anonymous> (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:450:61)\n' +
    '    at Connection.emit (events.js:311:20)\n' +
    '    at Connection.EventEmitter.emit (domain.js:482:12)\n' +
    '    at processMessage (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:384:10)\n' +
    '    at Socket.<anonymous> (/var/www/app/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/connection.js:553:15)\n' +
    '    at Socket.emit (events.js:311:20)\n' +
    '    at Socket.EventEmitter.emit (domain.js:482:12)\n' +
    '    at addChunk (_stream_readable.js:294:12)\n' +
    '    at readableAddChunk (_stream_readable.js:275:11)\n' +
    '    at Socket.Readable.push (_stream_readable.js:209:10)\n' +
    '    at TCP.onStreamRead (internal/stream_base_commons.js:186:23)\n' +
    ' => awaited here:\n' +
    '    at Function.Promise.await (/var/www/app/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)\n' +
    '    at packages/mongo/mongo_driver.js:1049:14\n' +
    '    at /var/www/app/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40\n' +
    ' => awaited here:\n' +
    '    at Promise.await (/var/www/app/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:60:12)\n' +
    '    at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1098:38)\n' +
    '    at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1112:22)\n' +
    '    at SynchronousCursor.map (packages/mongo/mongo_driver.js:1122:10)\n' +
    '    at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1146:17)\n' +
    '    at Cursor.<computed> (packages/mongo/mongo_driver.js:894:44)\n' +
    '    at Cursor.cursorProto.<computed> (packages/lmachens_kadira.js:3353:32)\n' +
    '    at Cursor.kadira_Cursor_fetch [as fetch] (packages/lmachens_kadira.js:3747:32)\n' +
    '    at Object.<anonymous> (packages/matb33_collection-hooks.js:484:79)\n' +
    '    at Object.collection.<computed> [as update] (packages/matb33_collection-hooks.js:150:21)\n' +
    '    at Collection.update (packages/mongo/collection.js:588:31)\n' +
    '    at Collection.Mongo.Collection.<computed> [as update] (packages/aldeed:collection2/collection2.js:214:19)\n' +
    '    at packages/zuuk_stale-session.js:55:18\n' +
    '    at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1234:12)\n' +
    '    at packages/meteor.js:550:25\n' +
    '    at runWithEnvironment (packages/meteor.js:1286:24)'

Once it stopped, I tried to run a job manually again (with a server-side method I call to run Jobs.execute(id)) I got an error like Jobs: Unable to execute job - queue is busy: myjob/r2ecan5aACartW9Kz

When the mongo connexion is lost for some time (about one minute), it stops. I think it's not so difficult to reproduce with a local setup running mongo on the side with MONGO_URL and then stopping and restarting it.

macrozone commented 3 years ago

@msavin We also observed this problem. Database seemed to have a hickup and this breaks the job queue. In our logs we find:

Exception in setInterval callback: MongoError: Pool was force destroyed at Pool.destroy (/src/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/connection/pool.js:679:21) at Server.destroy (/src/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/topologies/server.js:902:15) at Timeout._onTimeout (/src/bundle/programs/server/npm/node_modules/meteor/npm-mongo/node_modules/mongodb/lib/core/topologies/replset.js:357:26) at listOnTimeout (internal/timers.js:549:17) at processTimers (internal/timers.js:492:7)
=> awaited here:
at Function.Promise.await (/src/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:56:12)
at packages/mongo/mongo_driver.js:1073:14
at /src/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/fiber_pool.js:43:40
=> awaited here:
at Promise.await (/src/bundle/programs/server/npm/node_modules/meteor/promise/node_modules/meteor-promise/promise_server.js:60:12)
at SynchronousCursor._nextObject (packages/mongo/mongo_driver.js:1122:38)
at SynchronousCursor.forEach (packages/mongo/mongo_driver.js:1136:22)
at SynchronousCursor.map (packages/mongo/mongo_driver.js:1146:10)
at SynchronousCursor.fetch (packages/mongo/mongo_driver.js:1170:17)
at Cursor.<computed> [as fetch] (packages/mongo/mongo_driver.js:918:44)
...
at queue.grabDoc (packages/msavin:sjobs/server/imports/operator/queue/index.js:65:36)
at queue.run (packages/msavin:sjobs/server/imports/operator/queue/index.js:100:20)
at queue.trigger (packages/msavin:sjobs/server/imports/operator/queue/index.js:55:9)
at Meteor.EnvironmentVariable.EVp.withValue (packages/meteor.js:1234:12)

can we catch this gracefully?

macrozone commented 3 years ago

@vparpoil did you manage to solve it? Any idea you can share? Or do you have a workaround?

macrozone commented 3 years ago

@msavin do you have any idea? its unfortunatly very urgent and i have to come up with a solution these days

macrozone commented 3 years ago

i think i found the problem:

just catching the error in trigger is not enough, you have to set self.available = true; afterwards.

i opened a pr https://github.com/msavin/SteveJobs/pull/95

Edit: i initialy set available= true i a finally block, but that would set it always. We observed jobs running twice, so i think its because of that.

By the way: We run jobs only on a dedicated instance (its marked with an env var and is not part of the load balancer). I think distributing potentially cpu-bound work accross web-applications is often not a good idea as you might block user requests. Its better to run it on them on a dedicated workers

msavin / SteveJobs

V3.1.1 - Job runner stopped, jobs are not running anymore #92