vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
385 stars 68 forks source link

Retries can go below 0 and make validation fail #183

Closed huttarichard closed 8 years ago

huttarichard commented 8 years ago

First of all, great job with collection, awesome piece of sw!! I was working with job collection for a while, but I still unpredictable run into one issue:

Error: Match error: Failed Match.Where validation in field [0].retried
    at exports.check (packages/check/match.js:34:1)
    at JobCollectionBase._DDPMethod_getJob (packages/vsivsi_job-collection/src/shared.coffee:395:7)
    at JobCollection.<anonymous> (packages/vsivsi_job-collection/src/server.coffee:147:22)
    at packages/vsivsi_job-collection/src/server.coffee:85:40
    at methodCall (packages/vsivsi_job-collection/job/src/job_class.coffee:19:18)
    at Job.refresh (packages/vsivsi_job-collection/job/src/job_class.coffee:790:14)
    at Context.<anonymous> (imports/shared/_jobs.tests.js:178:13)
    at run (packages/practicalmeteor:mocha/meteor/src/setup/SetupRunnables.js:113:18)

Did not found the source (no coverage no specs) and partialy I was convinced that is probably my mistake. But now I think that I found a bug in shared.coffee

so stack is pointing to: https://github.com/vsivsi/meteor-job-collection/blob/master/src/shared.coffee#L395

Here is what I did:


class Job2 extends Job {

  // reset method written by me
  reset(hard = false) {
    if(_.contains(['waiting', 'ready'], this._doc.status))
      return false;
    var isRestarted = false;
    if (_.contains(['running'], this._doc.status) && hard)
      this.cancel();       
    if(_.contains(JobCollection.prototype.jobStatusRestartable, this._doc.status))
      isRestarted = this.restart();     
    return isRestarted;
  }
}

// running in context of should.js and mocha
let collection = new JobCollection('Test'); 
let job = new Job2(collection, 'test/job/running', {ok: 'testing'});
job.save();
job.reset().should.not.be.ok(); // .should.not.be.ok() -> same as === false
job.cancel();
job.refresh();
job.reset().should.be.ok();
job.refresh(); // <-- here start problem in a stack marked as imports/shared/_jobs.tests.js:178:13

I put scrub function in job collection which shows me that retries value is -1 which go againts gte 0.

vsivsi:job-collection@1.3.3 METEOR@1.3.2.4

vsivsi commented 8 years ago

Sorry for the delay, I'm looking into this now.

vsivsi commented 8 years ago

I have confirmed this as a bug:

    j = new Job(testColl, jobType, { foo: "bar" })
    j.save()
    j.refresh()
    j.cancel()
    j.refresh()
    j.restart()
    j.refresh()   # Check fails here.

I'll work on a fix over the weekend, shouldn't take too long.

huttarichard commented 8 years ago

@vsivsi sweet! I dig little bit around, just to help you out: https://github.com/vsivsi/meteor-job-collection/blob/master/src/shared.coffee#L728 here is problem, since if you restart job and job has 0 retried it will do an {$inc: -1}, which will make job retried -1 and on refresh action it will validate doc and throw match error.

here is test https://github.com/vsivsi/meteor-job-collection/pull/184

vsivsi commented 8 years ago

Great, thanks, I'll try to get this wrapped up today.

vsivsi commented 8 years ago

Hi, so I've worked up a solution, on the dev branch, but I'm not 100% satisfied with it.

The line that causes this issue is here: https://github.com/vsivsi/meteor-job-collection/blob/master/src/shared.coffee#L753

This bug is actually a little worse than it originally seemed. It's not just an "off-by-one" bug. If you pass, say 5 in options.retries, then unless job.retried > 5, the same failure you reported will occur.

The purpose of subtracting the new options.retries value (default 1) from job.retried is to keep job.retries + job.retried constant. This is important, because this information is used to calculate the new value of job.retries when a repeating job runs again. So, just removing L753 has the side-effect of increasing the number of retries on future repeating runs by the amount of options.retries (default 1).

Here are the possible options as I see them, with pros / cons:

  1. Remove the modification of job.retried at L753, and allow options.retries on job.restart() to have a value of zero (previously >= 1). This is what is currently implemented on the dev branch. Pros: Simple change, allow previous behavior for jobs cancelled before they run out of retries. Cons: Any non-zero value for options.retries (including the default) will cause the retry count of future runs of a repeating job to increase by options.retries. This is a behavior change that seems relatively harmless, but I don't love it.
  2. Change the job document model (and corresponding run-time checks) to permit the value of job.retried to be a negative integer. Pros: This is probably the simplest change, with the fewest side-effects. Cons: This is a subtle, but non-trivial change to the document model, which has been frozen for a long time, and so I don't want to do it lightly. Given the check that has long been in place (that you encountered) it is unlikely that any code out there will break by making this more permissive, but it introduces a set of possible states for job.retried that are semantically unclear (what does a negative value for retried actually mean? It means you retried, ultimately failed, and now are retrying some more... Edited: even I didn't get this right the first time... It means that the job was cancelled or failed, and is being restarted with more additional retries than had been used to this point. Not at all obvious.) On the other hand, that is just acknowledging the actual long time behavior, which has worked fine, so long as you never ran job.refresh().
  3. Simply remove the check from job.refresh(). Pro: Easy, Con: Super hacktastic. 😝
  4. Fix the whole issue by updating the job document model to add a repeatRetries attribute to explicitly capture the proper state of job.retries for the next repeat, without resorting to calculating it from the sum of job.retries and job.retried. Pro: Arguably the right way to fix this, how it should have been done in the first place. Cons: _By far_ the most disruptive. It involves a major change to the document model, and changes to the implementation of multiple methods and a fair amount of code. This requires more planning and testing effort to make sure nothing breaks. It is possibly a semver 2.0 level change, although perhaps not, if the method interfaces and behavior of the current attributes don't change.

I'm leaning towards 2) or 4), but I need to look into how much work option 4) will really require.

Thoughts?

huttarichard commented 8 years ago

I was actually really confused about the -options.retries (also reason why the I didn't make a pull request right away). For me (or others) there could be simple patch for now:

put scrub function into collection, this will prevent doc retried property to be an negative int.

scrub(doc) {
        if(doc.retried < 0) doc.retried = 0; 
        return doc;
}

and we can think about one of the options you wrote above. I believe that you know this source best, so its pointless for me to get deeper, because from 99% I will endup where you just did. So here is my suggestion.

4) sounds awesome, will require more work, but mostly more testing and thats the part which can be done from me side. I can write more specs, that is no problem. In a meantime you can implement option 2) which is 100% better then scrub function.

what does a negative value for retried actually mean?

I wouldn't be concerned about this, because chance that you will get in touch with retried negative value is really small :)

vsivsi commented 8 years ago

Alright, I'm going to switch the dev branch to option 2) because it's easy, while I continue to investigate 4). If necessary, I have no problem doing a release with option 2), it's basically the status quo without the check failure. Code-wise I'm convinced it's safe, I just don't like the semantics, but that 98% affects only me.

vsivsi commented 8 years ago

FYI, dev branch is now option 2), works fine.

vsivsi commented 8 years ago

Okay, 4) wasn't too bad. It's on the repeatRetries branch. Note that this also requires an update to the meteor-job sub-project in the job directory (which also now has a repeatRetries branch).

Please test with this branch and let me know how it goes. The new repeatRetries attribute of the job document is optional, so this is not a breaking change. Existing databases populated with old jobs should continue to work. And new job documents remain compatible with older servers and clients as well.

Not sure if you've done package development in Meteor before or not. To run this in your app, the following should work:

cd <yourAppDir>
mkdir packages
cd packages
git clone --recursive https://github.com/vsivsi/meteor-job-collection.git
cd meteor-job-collection
git pull
git checkout repeatRetries
cd job
git pull
git checkout repeatRetries
cd ../../..
meteor
huttarichard commented 8 years ago

@vsivsi awesome, really awesome. Cannot wait ... tomorrow I will write some specs about this (in my app). So going to let you know how it goes and if I find out something interesting. I really appreciate your effort. Thanks.

vsivsi commented 8 years ago

All of the changes proposed to go into 1.4.0 have now been merged into the dev branch (including repeatRetries). Please test against the dev branch now. I'll release as soon as I hear back that everything looks good on your end.

huttarichard commented 8 years ago

@vsivsi ok ok, Im on it...

huttarichard commented 8 years ago

@vsivsi I removed old version of job collection (+ scrub function), push new one. Run all specs and all passes (three specs just for this cancel / refresh issue)... Looks good for me...

vsivsi commented 8 years ago

Great! I'm just chasing a test failure in meteor-job that only occurs on node.js 0.10, (related to a different new feature) not sure how long that will take, but once I understand why node 0.10 is behaves differently it should be quick to fix the test. So, I should release within a few hours.

PS. When I wrote:

Please test against the dev branch now

By "now", I meant to now use the dev branch instead of the repeatRetries branch, not to command you to do it immediately! 😄

huttarichard commented 8 years ago

By "now", I meant to now use the dev branch instead of the repeatRetries branch, not to command you to do it immediately!

😄 😄 😄 ahh, doesn't matter anymore

vsivsi commented 8 years ago

Fix just published to Atmosphere as version 1.4.0. Thanks again for reporting this!

huttarichard commented 8 years ago

@vsivsi thank you man!