Proper clock synchronization is required when running multiple Meteor servers/workers

vsivsi / meteor-job-collection

A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.

https://atmospherejs.com/vsivsi/job-collection

Other

388 stars 68 forks source link

Proper clock synchronization is required when running multiple Meteor servers/workers #260

Open satyavh opened 6 years ago

satyavh commented 6 years ago

I'm a long time user of this package. But recently I run into a problem that a single repeating job is cloning itself hundreds of time. This is the code:

    checkLinkedInExpirationJob = new Job(Jobs, 'checkLinkedInExpiration', {})
    checkLinkedInExpirationJob.repeat(
      schedule:Jobs.later.parse.text('at 4:05 am')
    ).retry({ retries: 0 }).save()

It correctly creates a single job to run at 4:05am.

Then it starts running, and it creates hundreds copies of itself. Each copy runs, gets status 'completed' and none of them failed (I checked the logs).

Here's the abbreviated code of the worker:

@checkLinkedInExpirationWorker = Jobs.processJobs 'checkLinkedInExpiration', (job, callback) ->
  Log.info "checkLinkedInExpirationWorker starting"

  users = AppUser.find({'providers.provider': 'linkedin'})

  for u in users
    linkedIn = _.find u.providers, (p) -> p.provider is 'linkedin'

    # find the admins of the user's account, exclude if the user is the admin
    accountAdmins = AppUser.find(
      _id: $ne: u._id
      roles: 'admin'
      accountId: u.accountId
    ).fetch()

    expDate = moment(linkedIn.access_token.date_expiration, 'MM/DD/YYYY H:m:s').add(1, 'days')
    expiresInFourteenDays = moment(linkedIn.access_token.date_expiration, 'MM/DD/YYYY H:m:s').subtract(14, 'days')
    # expiresIntwoDays = moment(linkedIn.access_token.date_expiration, 'MM/DD/YYYY H:m:s').subtract(2, 'days')
    now = new Date()

    # if expired
    if expDate.isSame(now, 'day')
      Log.info "checkLinkedInExpirationWorker -> user #{u._id} LinkedIn is expired"
      if u.emails[0]?.address
        # send an email

      for admin in accountAdmins
         # send an email   

      Meteor.defer ->
        Meteor.call 'linkedin.remove', (err, res) ->
            Log.info "checkLinkedInExpirationWorker -> removed LinkedIn" if res
      continue

    # if expires in 14 days
    if expiresInFourteenDays.isSame(now, 'day')
        # send an email

      for admin in accountAdmins
         # send an email

  job.done()
  callback()

With another job I have exactly the same issue. I just clones itself about 40-50 times. I checked the Job collection and the cloning is kind of random.

I'm literally out of ideas, starting to think about replacing this package :-(

Any idea what could result in jobs cloning themselves?

satyavh commented 6 years ago

btw, this is the db json of the job that cloned itself 100s times:

{ 
    "_id" : "o9QevPZhtzr5NH24a", 
    "runId" : null, 
    "type" : "checkLinkedInExpiration", 
    "data" : {

    }, 
    "created" : ISODate("2017-10-16T04:57:08.589+0000"), 
    "priority" : NumberInt(0), 
    "retries" : 9007199254740992.0, 
    "repeatRetries" : 9007199254740992.0, 
    "retryWait" : NumberInt(20000), 
    "retried" : NumberInt(0), 
    "retryBackoff" : "constant", 
    "retryUntil" : ISODate("275760-09-13T00:00:00.000+0000"), 
    "repeats" : 9007199254736496.0, 
    "repeatWait" : {
        "schedules" : [
            {
                "t" : [
                    NumberInt(14700)
                ]
            }
        ], 
        "exceptions" : [

        ]
    }, 
    "repeated" : NumberInt(4496), 
    "repeatUntil" : ISODate("275760-09-13T00:00:00.000+0000"), 
    "depends" : [

    ], 
    "resolved" : [

    ], 
    "progress" : {
        "completed" : NumberInt(0), 
        "total" : NumberInt(1), 
        "percent" : NumberInt(0)
    }, 
    "status" : "waiting", 
    "updated" : ISODate("2017-10-16T04:57:08.589+0000"), 
    "log" : [
        {
            "time" : ISODate("2017-10-16T04:57:08.589+0000"), 
            "runId" : null, 
            "message" : "Rerunning job", 
            "level" : "info"
        }
    ], 
    "after" : ISODate("2017-10-17T04:05:00.050+0000")
}

And this one clones itself tens of times:

{ 
    "_id" : "RS3ek6XngHcJZ4TR5", 
    "runId" : null, 
    "type" : "dailyReviewEmail", 
    "data" : {

    }, 
    "created" : ISODate("2017-10-16T00:16:18.591+0000"), 
    "priority" : NumberInt(0), 
    "retries" : 9007199254740992.0, 
    "repeatRetries" : 9007199254740992.0, 
    "retryWait" : NumberInt(20000), 
    "retried" : NumberInt(0), 
    "retryBackoff" : "constant", 
    "retryUntil" : ISODate("275760-09-13T00:00:00.000+0000"), 
    "repeats" : 9007199254736386.0, 
    "repeatWait" : {
        "schedules" : [
            {
                "t" : [
                    NumberInt(300)
                ]
            }
        ], 
        "exceptions" : [

        ]
    }, 
    "repeated" : NumberInt(4606), 
    "repeatUntil" : ISODate("275760-09-13T00:00:00.000+0000"), 
    "depends" : [

    ], 
    "resolved" : [

    ], 
    "progress" : {
        "completed" : NumberInt(0), 
        "total" : NumberInt(1), 
        "percent" : NumberInt(0)
    }, 
    "status" : "waiting", 
    "updated" : ISODate("2017-10-16T00:16:18.591+0000"), 
    "log" : [
        {
            "time" : ISODate("2017-10-16T00:16:18.591+0000"), 
            "runId" : null, 
            "message" : "Rerunning job", 
            "level" : "info"
        }
    ], 
    "after" : ISODate("2017-10-17T00:05:00.054+0000")
}

vsivsi commented 6 years ago

Hi, obviously the behavior you are describing is not normal, and is not how the package typically performs.

To get to the bottom of this, you should produce a simplified Meteor app reproducing this issue with as little code as possible not critical to reproducing the issue. Ideally a stripped down app that does nothing but schedule and run such a job. If you can isolate this behavior from the rest of your application logic (and other unnecessary package cependencies) then you can commit that reproduction code to a repo on Github. If I can reproduce the issue in running code, then I will be able to very quickly diagnose what is happening.

Trying to do it any other way will be huge time sink for both of us.

In my experience, weird problems like this are often the result of interactions with packages that monkey-patch core Meteor APIs in unknown ways. But it’s also possible there is something in your code that I won’t be able to see without you sharing a whole running app with me.

satyavh commented 6 years ago

Got it, thanks for your answer.

I'm currently stripping down the job to see when it stops showing this behaviour. I kind of suspect Meteor.defer, however the other job that also clones itself does not use it.

And forgot to mention, I only have this issue with recurring jobs. One-off jobs are doing just fine.

If I find something reproducible I will post it here.

satyavh commented 6 years ago

@vsivsi question: I think I found the issue:

We have 2 servers running. 1 server is running the job correctly (meaning, it runs it once), the other one is creating clones. There seems to be a time discrepancy between the 2 servers:

The job is scheduled at 04:05am.

server 1 starts running the job around 2:54am every day. It creates 4 jobs per minute, exactly 15 seconds between each clone. Between 03:23am 04:00am it stops cloning them.
server 2 runs the job somewhere between 04:35am - 04:57am

The job never runs on the scheduled time. And both servers pick up the job, one is cloning it.

So I checked the server times, they were out of sync. I synced them, so both servers now run at exactly the same time. And...it seems to solve the issue! Plus the job is running correctly on time.

But this all looks very fragile to me. It's quite easy to have servers running out of sync. For your reference / reproduction, the server time difference was like this.

Server 1: Universal time: Wed 2017-10-18 17:26:43 UTC

Server 2: Universal time: Wed 2017-10-18 18:38:54 UTC

vsivsi commented 6 years ago

Hi, as I'm sure you can appreciate, clock synchronization is very important to virtually all distributed applications. Given that your servers were more than 1 hour off from one another, I'm frankly surprised that far more serious things didn't break than what you've described in this issue.

Most distributed databases, etc. don't work at all if server clocks are more than a few seconds apart. In this case it seems that you had a single MongoDB instance, but multiple Meteor servers running against it. It may seem "fragile" but job-collection instances only communicate via the database they are attached to, and they may "come-and-go" freely, without any complicated synchronization beyond using NTP to keep the server clocks within some fraction of a second of one another. Given that running jobs on a schedule is an inherently time-based activity, having accurate clock settings seems like a very basic and obvious prerequisite to having such a system run reliably.

So, I'm not really sure what I can do to make this less fragile. I suppose a check could be run at startup (or periodically) to compare the Meteor server time to the MongoDB server time and if they differ by more than some delta (say a few seconds) then throw with a warning on the Meteor side. Even that is a bit difficult to measure with network latency, etc. (precisely the problem that protocols like NTP address).

Anyway, thanks for reporting this. I'll think about it. At a minimum there should probably be a note in the documentation about supporting multiple servers and the importance of clock synchronization.

satyavh commented 6 years ago

I understand 😄

it seems that you had a single MongoDB instance, but multiple Meteor servers running against it

Yes that's right. We're using mlab replica set for Mongo hosting, 2 meteor servers behind nginx.

What I could imagine is that the server who picks up the job adds an id or something to the job, so that another worker never picks it up even if server times are out of sync (I think SteveJobs works like this https://github.com/msavin/SteveJobs-meteor-jobs-queue). But honestly I don't know exactly how your package works, so this is just a long shot 😄

But even if that would work, I still don't understand why it starts cloning jobs like crazy. Why would a job clone itself? That is the part that mystifies me and I feel is fragile:

If server 1 picks up the job 1 hour too earlier because it is out of sync, it could just run it 1 hour too early and complete it. Then server 2 would never pick it up.

But what actually happens is that server 1 picks it up 1 hour too early, completes it, clones it hundred times, completes all the clones. And then finally, server 2 also picks it up, but probably does not clone it because server 2 time > server 1 time.

vsivsi commented 6 years ago

Hi, I think this behavior is pretty simple to explain in the context of how job collection works... Take this example of how a repeating job will run with a single server/worker:

Submit a job to the server configured to run every day at 4:05:00 UTC.
The job enters the database in the waiting state (because it is not yet 04:05).
Some time shortly after 4:05, the job is "promoted" to the ready state.
A worker requests work for that job type, gets the job, completes it, and tells the server it is done.
The server marks that job as completed, and submits a new one to run at the next occurrence of 4:05 (tomorrow).
Goto step 2.

Now let's add a second server with a clock that is one hour off to this scenario. We'll call these Server1 (earlier clock) and Server2 (later clock).

Submit a job to either server configured to run every day at 4:05:00 UTC.
The job enters the database in the waiting state (because it is not yet 04:05 on either server).
Some time shortly after 4:05 on Server2, the job is "promoted" to the ready state in the database.
Now assume that a worker for this type of job is connected to Server1. The job is marked as ready in the database, so the worker obtains and completes the job.
Since the worker is connected to Server1, that server marks the job as completed, and submits a new one to run at the next occurrence of 4:05 (today, remembering that Server1 is an hour behind Server2).
The job enters the database in the waiting state (because it is not yet 04:05 on Server1).
Server2 quickly marks the new job as ready in the database because it is after 4:05 according to its clock.
Goto step 4.

Voila. You will run as many jobs in that hour as the job runtime and server "promotion interval" allow. Server1 creates each new job to run at 4:05 today (as long as that is in its future) and Server2 quickly promotes those jobs to ready because 4:05 today is in its past.

For jobs that are scheduled to "wait" some period of time, the code that handles this adds a few seconds of slop to ensure that a small clock offset will not cause the job to instantly repeat, but that doesn't help in the case of jobs scheduled to run at specific times when the server clocks are off by more than the time to complete one whole cycle of: submit, promote, run, complete, resubmit.

Hope this explanation helps...