vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
385 stars 68 forks source link

How to restart a running job when the app crashes (pick-up work before the crash)? #142

Closed a4xrbj1 closed 7 years ago

a4xrbj1 commented 8 years ago

Hi Vaughn,

I had a running job (marked with the status) when the app crashed due to a type error in the middle of executing this job. After fixing (hopefully) that I error I want the app to pick-up that job from the list of waiting job and continue work as if the error wouldn't have happened. Is that possible? What steps are needed to achieve this?

I tried updating the job.status to "ready" and restarted the app but to no avail. This job has a couple of other jobs chained after if, so deleting all those job and then manually restarting that function wouldn't be a nice work.

Andreas

PS: Happy New Year!

vsivsi commented 8 years ago

Hi, The normal way to do this would be to just job.fail() the job. If the job was configured to automatically retry, then it should reschedule itself to "ready" and run again when a worker is available. If it was not configured to retry, then the job will go to status "failed" and can be rerun using job.restart() See: https://github.com/vsivsi/meteor-job-collection#jobrestartoptions-callback---anywhere

a4xrbj1 commented 8 years ago

Thank you Vaughn!

a4xrbj1 commented 8 years ago

Sorry, have to reopen this, I just can't accomplish it, even with your explanations so far. Here's my code:

res = myJobs.findOne({ status: 'failed' });
job = res.data;
job.restart();

Fails with the error message: TypeError: Object [object Object] has no method 'restart'

I can't figure out what job is. I've tried to assign job to the result of the query but it failed with the same error as above. There are only two references to job.restart() in the documentation and it's not explained at all how to use it (whereas all the rest is well documented).

I've manually changed the status of the job to failed and thought that it would restart automatically but even after now working more 1.5h on it it never has restarted. I've also tried setting status to ready or waiting with no restarting effect.

I've restarted my whole program many times but again it doesn't restart this job automatically.

I've also tried the alternative of restartJob with this code:

jobId = this.bodyParams.jobId;
jobIdArray = [];
jobIdArray.push(jobId);
myJobs.restartJobs(jobIdArray);

But then I get the error message: jobRestart failed

This is the whole job as of now: { "_id" : "rExpzgmvNhoc5eke8", "runId" : "CSnTNhBmDPui8hDq3", "type" : "oneToOneDecodeOnce", "data" : { "kit1" : "M052624", "kit2" : "M060932" }, "status" : "failed", "updated" : ISODate("2016-04-26T05:15:40.952Z"), "created" : ISODate("2016-04-26T04:50:55.145Z"), "priority" : -10, "retries" : 9007199254740991, "retryWait" : 5000, "retried" : 1, "retryBackoff" : "exponential", "retryUntil" : ISODate("275760-09-13T00:00:00Z"), "repeats" : 0, "repeatWait" : 300000, "repeated" : 0, "repeatUntil" : ISODate("275760-09-13T00:00:00Z"), "after" : ISODate("2016-04-26T04:54:25.991Z"), "progress" : { "completed" : 0, "total" : 1, "percent" : 0 }, "depends" : [ ], "resolved" : [ "RYRFTYEXdfBAhQ6uv" ], "log" : [ { "time" : ISODate("2016-04-26T04:50:47.086Z"), "runId" : null, "level" : "info", "message" : "Constructed" }, { "time" : ISODate("2016-04-26T04:50:55.145Z"), "runId" : null, "message" : "Job Submitted", "level" : "info" }, { "time" : ISODate("2016-04-26T04:54:25.984Z"), "runId" : null, "message" : "Dependency resolved", "level" : "info" }, { "time" : ISODate("2016-04-26T04:54:25.991Z"), "runId" : null, "message" : "Promoted to ready", "level" : "info" }, { "time" : ISODate("2016-04-26T04:54:30.979Z"), "runId" : "CSnTNhBmDPui8hDq3", "message" : "Job Running", "level" : "info" }, { "time" : ISODate("2016-04-26T05:15:40.952Z"), "runId" : null, "message" : "Promoted to ready", "level" : "info" } ], "workTimeout" : 3600000, "expiresAfter" : ISODate("2016-04-26T05:54:30.978Z") }

vsivsi commented 8 years ago

Hi, this is really basic stuff. A document from a MongoDB lookup can't (automatically) be a full JavaScript object with methods etc. which is what a Job object needs to be.

So you need to do something like:

res = myJobs.findOne({ status: 'failed' });
job = new Job(myJobs, res);
job.restart();

You can set up any Meteor collection to do this automatically using the transform parameter on the collection:

https://github.com/vsivsi/meteor-job-collection-playground/blob/master/play.coffee#L9

a4xrbj1 commented 8 years ago

Thanks Vaughn. Sorry for overlooking the basic stuff, it was always unclear to me what exactly a job object needs to be. Maybe you can add this explanation in your documentation for other newbies like me?

The job now has the status ready but it still doesn't start. What else am I probably doing wrong?

vsivsi commented 8 years ago

This code appears in the "client sample" code in the README:

    // Any job document from myJobs can be turned into a Job object
    job = new Job(myJobs, myJobs.findOne({}));

Right at the top of the section on the Job object API there is a whole subsection on this: https://github.com/vsivsi/meteor-job-collection#j--new-jobjc-jobdoc---anywhere

I'm always open to documentation PRs that would have made things clearer for you.

a4xrbj1 commented 8 years ago

Can I suggest to replace the current, rather incomplete example with the more concrete code here:

res = myJobs.findOne({ status: 'failed' });
job = new Job(myJobs, res);
job.restart();

I know you mention it in the existing sample (coming from a database source) but the more complete and less abstract the code example are the easier it is for newbies like to just copy and paste and get the right code going.

Could you also comment as to why the job is still not executed (see my edited comment above)?

vsivsi commented 8 years ago

If it's ready but doesn't run, then there's probably some issue with the worker. It's impossible for me to say without seeing code. To continue helping with this, you should package up your (minimal) app into a repo and share it with me so I can run it myself. Working through things like this going back and forth on a GitHub issue is tedious (like playing "20 questions") and unproductive.