vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Recover from server crash #233

Closed fadomire closed 7 years ago

fadomire commented 7 years ago

Hello, i'm playing with meteor-job-collection which look awesome

it seems i need to find a way to manage cases where server crash (currently everything happens on Meteor)

My jobs are saved with "repeat" option. Everything go smooth, but when crashed "processJobs" method does not pick the stored jobs that surely failed in some way during server crash.

Here is an example of a job that is not picked :

{
    "_id": "zjZzi7Y6w55YD3uke",
    "runId": "eqGoKT9no4KoxCwgH",
    "type": "sendAlert",
    "data": {
        "userId": "4X96PWAPHQdd7fTyW"
    },
    "created": ISODate("2017-05-17T17:51:06.365Z"),
    "priority": 0,
    "retries": 5,
    "repeatRetries": 6,
    "retryWait": 1000,
    "retried": 1,
    "retryBackoff": "constant",
    "retryUntil": ISODate("275760-09-13T00:00:00Z"),
    "repeats": 9007199254740988,
    "repeatWait": {
        "schedules": [{
            "m": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
        }],
        "exceptions": []
    },
    "repeated": 4,
    "repeatUntil": ISODate("275760-09-13T00:00:00Z"),
    "progress": {
        "completed": 0,
        "total": 1,
        "percent": 0
    },
    "depends": [],
    "resolved": [],
    "status": "running",
    "updated": ISODate("2017-05-17T17:52:12.279Z"),
    "log": [{
        "time": ISODate("2017-05-17T17:51:06.365Z"),
        "runId": null,
        "message": "Rerunning job",
        "level": "info"
    }, {
        "time": ISODate("2017-05-17T17:52:12.215Z"),
        "runId": null,
        "message": "Promoted to ready",
        "level": "info"
    }, {
        "time": ISODate("2017-05-17T17:52:12.282Z"),
        "runId": "eqGoKT9no4KoxCwgH",
        "message": "Job Running",
        "level": "info"
    }],
    "after": ISODate("2017-05-17T17:52:00.004Z")
}

Any idea on why it does not rerun even if it is saved as repeating ?

vsivsi commented 7 years ago

It's probably stuck as 'running'. You need to setup some kind of way to detect such jobs and cause them to "autofail" so they can be retried.

See for example the workTimeout option on processJobs: https://github.com/vsivsi/meteor-job-collection#jq--jcprocessjobstype-options-worker---anywhere

And this discussion: https://github.com/vsivsi/meteor-job-collection/search?p=2&q=workTimeout&type=Issues&utf8=✓

fadomire commented 7 years ago

Thanks for the quick and usefull reply !

i understood better how to fix my issue, specially with this discussion : https://github.com/vsivsi/meteor-job-collection/issues/86

it is our dutie as dev to build what's necessary to kill zombies ;)