iriscouch / follow

Very stable, very reliable, NodeJS CouchDB _changes follower
Apache License 2.0
393 stars 82 forks source link

Duplicate change events after confirm timeout and restart #53

Open mjq opened 10 years ago

mjq commented 10 years ago

Here's the relevant code block to follow along.

In Feed.prototype.confirm, a request is made to check if the DB is reachable, and a timeout is set to detect a slow response from Couch. If the timeout is hit, the Feed is killed (self.die is called). But, the request object isn't destroyed. That means that if Couch responds after the timeout, the happy path callback db_response still gets called.

Normally, this isn't that noticeable, since the Feed object is dead and everything short-circuits. But, if the user called restart on the feed in response to the error, dead will be false, and the Feed ends up getting set up twice (once in response to the timed-out request, and once due to restart(). This results in every change event getting called twice.

The fix would seem to be adding destroy_req(req); here before dieing. I haven't figured out how to write a test for this though. Any ideas?

jcrugzz commented 10 years ago

@mjq do you have any sample code that reproduces this? thats the best place to start for a test

mjq commented 10 years ago

Sorry, sure. Simplified, it's:

var follow = require('follow');
var db = '...';

var feed = new follow.Feed({db: db, include_docs: true});

feed.on('change', function(change) {
  console.log('got change %d', change.seq);
});

feed.on('error', function(err) {
  console.log('got error %s, restarting in 5s', err.message);
  setTimeout(function() {
    console.log('restarting');
    feed.restart();
  }, 5000);
});

feed.start();

Normally, the logs would look like

got change 5
got change 6
got change 7

But, if the first attempt to reach the database times out but responds shortly after, you'll see

got error "Timeout confirming database: <db name>", restarting in 5s
restarting
got change 5
got change 5
got change 6
got change 6
got change 7
got change 7
jcrugzz commented 10 years ago

@mjq this is fascinating, I've never seen this happen. Destroy_req, should be called by the die function but it seems like there is a race condition leaving two requests? Ill have to dig deeper on this when i have a minute

mjq commented 10 years ago

@jcrugzz die destroys self.pending.request, but the request in confirm is a local variable, so if it isn't destroyed in confirm, nothing will (or so it seems to me).

A simpler bug to test, repro and fix may just be:

Since db_response only applies to the success case, that alone is weird/wrong behaviour, and just by fixing that (by e.g. destroying the request in the timeout fn), it should prevent the double-listener stuff.

re: race conditions: We've got a single process simultaneously following an ever-changing set of a few thousand databases (with all those databases on the same CouchDB box). So, when requests to that box start stalling... well, if there's a race condition to be found, we'll find it, heh.

I'm giving this patch a trial by fire right now, but I don't know how long it will take for us to trigger the bug again.

jcrugzz commented 10 years ago

@mjq gotcha, this is before it is piped into the changes-stream. Let me know if you can reproduce that but that looks like a valid fix. Super edge case but I can see the potential for it happening.

arikon commented 9 years ago

@mjq @jcrugzz Are you going to fix this?

carrotalan commented 6 years ago

+1 - This is still an issue