Timeouts can not reconnect if event in requested index has been cleared

shannon commented 10 years ago

If there are no changes in the directory you are watching and etcd index has changed more than 1000 times the reconnection will continuously fail. This can happen quite often when using Fleet as it writes heartbeats a lot.

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1300446/1297481]',
        index: 1301445 } },
  reconnectCount: 1 }

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1300466/1297481]',
        index: 1301465 } },
  reconnectCount: 2 }

...

{ [Error: Connection error, reconnecting.]
  error: 
   { [Error: The event in requested index is outdated and cleared]
     errorCode: 401,
     error: 
      { errorCode: 401,
        message: 'The event in requested index is outdated and cleared',
        cause: 'the requested history has been cleared [1316622/1297481]',
        index: 1317621 } },
  reconnectCount: 12 }

I think the reconnection should take the index returned in the 401 error.

stianeikeland commented 10 years ago

Oh, that's not good.

I wonder what the best way to solve this is. We could go directly for the index in the error, but then risk missing some updates.

It seems like fleet just does reconnects with index + 1 until it resyncs. That could cause thousands of requests though.. https://github.com/bcwaldon/fleet/blob/8a269e880ee561ad1722e9de94d6ccb4e5656562/registry/event.go#L87-L91

I guess we could do (error.index - 999), that might work.

shannon commented 10 years ago

Sorry, I was away for the weekend.

Yea I wondered how they were dealing with it. I think your (-999) approach would work but only if something doesn't happen before you get the next request out.

I almost feel as though there are two separate cases here. One where you don't care about the stuff in between and you just need to get the watcher back in place and two where you are trying to watch every event as it happens. The second case is sort of a lost cause if the watcher is in this state already. Maybe its best to just make sure the connection is reestablished and then emit a warning event of some kind?

tjanczuk commented 10 years ago

We could go directly for the index in the error, but then risk missing some updates.

@stianeikeland Why would this approach miss any updates? If any changes had been made up to the index specified in the bug, etcd would respond with these changes. It should therefore be safe to restart polling from that index.

Below is a workaround of the issue I applied in my application. Note that if I detect etcd provided a new index in the error response, I am issuing a new poll immediately by substituting _watch method for _retry which uses a backoff timer. Let me know if you think this approach has a conceptual issue.

watch.on('reconnect', function (error) {
    // Work around a bug in node-etcd where the watcher etcd index 
    // becomes so old it is out of scope of the history etcd maintains.
    // Advance the watch index based on information supplied from etcd in the 
    // error response.
    var newIndex;
    if (typeof error.error === 'object' 
        && typeof error.error.error === 'object'
        && !isNaN(error.error.error.index))
        newIndex = error.error.error.index;
    else if (typeof error.headers === 'object'
        && !isNaN(error.headers['x-etcd-index']))
        newIndex = +error.headers['x-etcd-index'];

    if (newIndex !== undefined) {
        // Instead of _retry with a backoff timer, make the
        // watcher issue the request immediately with _watch.
        watch.index = newIndex;
        var oldRetry = watch._retry;
        watch._retry = function () {
            watch._retry = oldRetry;
            watch._watch();
        }                        
    }
});

stianeikeland commented 10 years ago

I've now added basic resync to trunk.

@tjanczuk I'm thinking about something like the following timeline:

client A connects and starts watching 'key' at index 50, key is currently 'value 1'
lots of updates happens on the server, index reaches 2000
client A is disconnected because of network issues
client B updates 'key' to 'value 2'
client A network issues over, reconnects, tries watching from 50 again (it has never gotten an index update)
server says index 50 is outdated and cleared, current index: 2100
client A retries, this time watching 2100.

client A is now resynced, but missed the key='value 2' update.

Now I just emit a 'resync' event from the watcher, so that you can be sure to check manually for missed values if needed

Maybe we could do the following on outdated and cleared index:

Do a regular get .. check the current key/values index.
if our tracking index is lower than current key/value index - emit the new value.
watch from error index

Or am I overthinking things? :)

stianeikeland commented 10 years ago

Published, could you try out version 2.1.5?

tjanczuk commented 10 years ago

I think you are right; what I proposed above would not address the cases of a race condition between broken network connections and updates on the server.

I would suggest slight optimization over what you describe above:

If the response from the server contains a new etcd index in any shape or form, the next long poll should start listening from that index, without delay. This will be the case when there were no updates to the value in etcd before the server decided to gracefully terminate the long poll.
If the long poll results in an error in which the server does not send a new etcd index (e.g. TCP connection broken), node-etcd should do a regular get on the value listened to, compare that value to the last value seen and issue a changed event if it is different, and then resume long polls from the etcd index in the get response.

As an aside, a useful feature in node-etcd would be a modified event that only fires when the value has actually changed.

tjanczuk commented 10 years ago

Actually I think I am overthinking it now. If there was an error due to network issues, and the index has become out of scope in the meantime, the next successful request will error with a 401 code, and your case here: https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L71-L72 should cover it.

So looking at your code it seems everything should work correctly. The one slight optimization that can save a roundtrip to etcd is to also advance the index in case of the empty long poll response (https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L75-L76) - this is to save an occasional 401 for infrequently changing values.

tjanczuk commented 10 years ago

The 2.1.5 appears to perform well for me in a long-running/infrequent-changing scenario. Thanks for the quick turnaround!

stianeikeland commented 10 years ago

Thanks! Also agree with your optimizations.

stianeikeland / node-etcd

Timeouts can not reconnect if event in requested index has been cleared #31