vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Let running jobs fail on process shutdown #262

Closed SimonSimCity closed 6 years ago

SimonSimCity commented 6 years ago

Today, I was researching some time of how the problem of zombie-jobs could be solved. I read through #82, #86 and #233 and they were mostly about detecting zombie-jobs. But what about limiting the possibility of a job turning into a zombie-job?

I don't know how good and doable this is, therefore my idea here.

If you know the list of jobs currently running on your server/client, wouldn't it be possible to set these to failed using either the beforeunload or maybe also the error event when running in the browser or the NodeJS event uncaughtException when running on the server?

These jobs could be marked as failed and - if they are allowed to - rerun when the server starts up again or another job-worker becomes idle. This way, we could prevent these jobs from turning into zombie jobs.

This library should then maybe also differ between the different failure-types. So far I see a different approach if a job crashed by an exception, by an internal call on job.fail() or if it simply was unluckily running while another part of the codebase threw an uncaught exception.

I found and extended a snippet from a stackoverflow question:

function exitHandler(options, err) {
  if (options.cleanup) console.log('clean');
  if (err) console.log(err.stack);
  if (options.exit) process.exit();
}

// do something when app is closing
process.on('exit', exitHandler.bind(null, { cleanup: true }));

// catches ctrl+c event
process.on('SIGINT', exitHandler.bind(null, { exit: true }));

process.on('SIGTERM', exitHandler.bind(null, { exit: true }));

// catches "kill pid" (for example: nodemon restart)
process.on('SIGUSR1', exitHandler.bind(null, { exit: true }));
process.on('SIGUSR2', exitHandler.bind(null, { exit: true }));

// catches uncaught exceptions
process.on('uncaughtException', exitHandler.bind(null, { exit: true }));

Source: https://stackoverflow.com/questions/14031763/doing-a-cleanup-action-just-before-node-js-exits#answers

vsivsi commented 6 years ago

Hi, if I’m understanding you correctly, there’s code showing how to do this in one of the sample apps:

https://github.com/vsivsi/meteor-job-collection-playground-worker/blob/master/work.coffee#L120

SimonSimCity commented 6 years ago

🎉 that there looks quite promising. I was just wondering because all these issues I referenced here were not really coming to a solution, but this here seems like a very good way to come close to avoid zombie-jobs.

Would it be an idea to include this in the readme.md file? It tells you how to get started, right and I think this here is an important thing that you easily forget about.

SimonSimCity commented 6 years ago

But there's also one sad message with his here: Every method I've found here does not allow to do the change within the same js-cycle. It either uses methods or calls Meteor.setTimeout() which again puts it to a different cycle. I now did the following and it seems working as I expect it to do.

const cleanup = Meteor.bindEnvironment(() => {
  // Copied from the ddp-method shutdownJobServer, but without setInterval().
  const cursor = StripesJobs.find({ status: 'running' }, { transform: null });

  const failedJobs = cursor.count();
  if (failedJobs !== 0) {
    console.warn(`Failing ${failedJobs} jobs on queue stop.`);
  }

  cursor.forEach(d => StripesJobs._DDPMethod_jobFail(d._id, d.runId, { message: 'Running at Job Server shutdown.' }));

  if (StripesJobs.logStream) {
    // Shutting down closes the logStream!
    StripesJobs.logStream.end();
    StripesJobs.logStream = null;
  }
});

// When this app is terminated by process.exit(), which now includes SIGINT, SIGTERM and SIGHUP.
process.on('exit', () => cleanup());

function exitHandler() {
  process.exit();
}

// catches ctrl+c event
process.on('SIGINT', exitHandler);

// catches a SIGTERM, sent by a posix-os
process.on('SIGTERM', exitHandler);

// catches closing the parent-process (f.e. console window)
process.on('SIGHUP', exitHandler);

Note: But be aware of that every handler, that is registered after my process-events won't be executed, because I kill the process. That is the reason why everyone should include this in his own project at a place where he's certain, that all the plugins are loaded which might also use this event. For more information please read https://nodejs.org/api/process.html#process_signal_events

I personally also like to include the uncaughtException event. Even though it's not necessary, because it also triggers the exit event, it gives you the possibility to do something with the exception (f.e. color it if your console supports it). But once you registered it, you have to kill the process. For this, you'd just have to add a listener and change the function exitHandler() as following:

function exitHandler(err) {
  if (err) console.log(err.stack);
  process.exit();
}

// catches uncaught exceptions
process.on('uncaughtException', exitHandler);
SimonSimCity commented 6 years ago

I, personally, consider this issue solved, but I still see a necessarily to include this information, or at least a link to it, in the README.md file, since this is something every developer using this library should be made aware of.

This here - by no means - eliminates the problem of zombie-jobs, but reduces it to a very small portion. This method will not stop you from creating zombie-jobs when you:

vsivsi commented 6 years ago

Right, but don’t forget that the “zombie job” problem is also addressed by using the workTimeout option. That will ensure jobs assigned to a worker process that dies will be failed/retried and not left in a “zombie” running state.

vsivsi commented 6 years ago

If you would like to add a some text to the README to discuss how to cleanly handle signals and exceptions in worker processes, please submit a PR for that. I’d be happy to review and merge it! Thanks.

SimonSimCity commented 6 years ago

But the workTimeout can - in my eyes - only guess for zombies. Avoiding the rise of zombie-jobs, like my code does, is the right way, I think. Only for the sake of the cases I've noted here, that cannot be covered by this method, I'd use the workTimeout option, which is just a prediction, because you never know for sure.

I'll make a PR to write a summary about it. Why it occurs and the ideas of how to prevent it.

vsivsi commented 6 years ago

In my eyes, workTimeout is a contract between the worker and the server/back-end. It says: "Once this job is running, you will hear progress/status updates from me at least every N seconds".

When more than N seconds has elapsed without an update, the job is by definition a zombie. It is so because that is what the contract says. The worker then needs to be written to hold up its end of the deal. Anyway, it's not a "guess", it is true by definition.

You are absolutely correct that using workTimeout is not a proper substitute for rigorous signal/exception handling in the worker process. But such handling is likewise completely insufficient to prevent zombie jobs on the server. Computers lose power, hardware fails, network connectivity drops, stuff happens

So this isn't a matter of either/or/which is better? Both are necessary to robustly ensure that all jobs are eventually resolved.

SimonSimCity commented 6 years ago

Is this a good description? #263 We can discuss the rest in this PR.