vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
https://atmospherejs.com/vsivsi/job-collection
Other
388 stars 68 forks source link

Strange behavior with repeating jobs of the same type #235

Open tcastelli opened 7 years ago

tcastelli commented 7 years ago

Hello @vsivsi , I am currently finding a strange behaviour and maybe you can shed some light on the situation:

These are all the configured settings: Meteor 1.4.3.2 job-collection 1.5.1 server: Galaxy

JobCollection promote: every 30 min

Queue processJobs concurrency: 1, payload: 2, pollInterval: 1 minute, prefetch: 2, callbackStrict: true, workTimeout: 1 hour

Jobs of type TT priority : 'normal' repeat: schedule value "every 24 hours" retry : wait value 15 minutes, backoff: "constant"

I create 3 Jobs of the same type "TT" and what I am expecting to see is that when they are created, they are set to run for the first time on the following day at 00:00 (that works correctly), and then, that they will be promoted to ready as late as 00:30 and than they'll be fetched by the queue in groups of 2 starting 1 minute later. In this case since all 3 should become ready at the same time (no dependencies) the prefetch would make the third one also be fetched while processing the first 2. After being processed and completed next job should be created and scheduled for next day at 00:00. Intead, this is what I see: jc

I don't get why:

Update: Ok, so I have checked and it seems that whenever the job changes to ready, the after value gets updated, so the problem is why the "after" date is set to an incorrect value whenever the first job is done. I assume this happened since two of the jobs("mubJZ.....", and "rWNED......" were processed before the 24 at 00:00:00 and that's when they were supposed to run again like "eX6Q......")

Update2: I have seen this problem #217 and the workaround seems to be implemented in job save,

schedule = @later?.schedule(doc.repeatWait)
 +      unless schedule and next = schedule.next(2, schedule.prev(1, doc.after))[1]

but if you look at job.done function later is also used but without that fix

          next = @later?.schedule(doc.repeatWait).next(2)
          if next and next.length > 0
            d = new Date(next[0])
            if (d - time > 500) or (next.length > 1)
              if d - time <= 500
                d = new Date(next[1])
              else
              wait = d - time
              if doc.repeatUntil - wait >= time
                jobId = @_rerun_job doc, doc.repeats - 1, wait

~~Not sure if that has anything to do with the problem, did you check that part too @mitar ? (the 500 value seems a little weird to me)~~

Update3: I have verified that the problem comes from the previous code at job.done. If the server/db is busy and you set it like me to "every 24 hours", it is possible that the time of next[0] gets translated simply into a new Date() because of a later.js bug (if it gets done during the same 00 hour in my case) and this value can have a difference of more than 500 ms with the time when the variable time is created (db is busy or slow). In that case you stay with the value of next[0] which is the wrong date :(

I think the job.save workaround can be used also here to avoid this kind of problem.

Thanks for any update on this matter.

vsivsi commented 7 years ago

Hi, just so you know, I'm about to drop off the internet for almost a month, so I won't be able to look into this until around June 20th. Sorry for any inconvenience.

mitar commented 7 years ago

@tcastelli: Care to make a PR with a followup to #218 and with a test for what are you reporting here?

tcastelli commented 7 years ago

I can try, but not really into coffeescript so a revision must be done by someone who is more fluent on that. I think by just copying and pasting your workaround would make it work for the other case too. For now, what we have done is just use "every 24hours at XX:XX" being xx:xx the time delimitation, making it harder for the error to happen (It should be done and finished within the same second)

mitar commented 7 years ago

CoffeeScript will not bite you. :-) Try copying as you are suggesting and see if it works.