Open shannon opened 10 years ago
Oh, that's not good.
I wonder what the best way to solve this is. We could go directly for the index in the error, but then risk missing some updates.
It seems like fleet just does reconnects with index + 1 until it resyncs. That could cause thousands of requests though.. https://github.com/bcwaldon/fleet/blob/8a269e880ee561ad1722e9de94d6ccb4e5656562/registry/event.go#L87-L91
I guess we could do (error.index - 999), that might work.
Sorry, I was away for the weekend.
Yea I wondered how they were dealing with it. I think your (-999) approach would work but only if something doesn't happen before you get the next request out.
I almost feel as though there are two separate cases here. One where you don't care about the stuff in between and you just need to get the watcher back in place and two where you are trying to watch every event as it happens. The second case is sort of a lost cause if the watcher is in this state already. Maybe its best to just make sure the connection is reestablished and then emit a warning event of some kind?
We could go directly for the index in the error, but then risk missing some updates.
@stianeikeland Why would this approach miss any updates? If any changes had been made up to the index specified in the bug, etcd would respond with these changes. It should therefore be safe to restart polling from that index.
Below is a workaround of the issue I applied in my application. Note that if I detect etcd provided a new index in the error response, I am issuing a new poll immediately by substituting _watch
method for _retry
which uses a backoff timer. Let me know if you think this approach has a conceptual issue.
watch.on('reconnect', function (error) {
// Work around a bug in node-etcd where the watcher etcd index
// becomes so old it is out of scope of the history etcd maintains.
// Advance the watch index based on information supplied from etcd in the
// error response.
var newIndex;
if (typeof error.error === 'object'
&& typeof error.error.error === 'object'
&& !isNaN(error.error.error.index))
newIndex = error.error.error.index;
else if (typeof error.headers === 'object'
&& !isNaN(error.headers['x-etcd-index']))
newIndex = +error.headers['x-etcd-index'];
if (newIndex !== undefined) {
// Instead of _retry with a backoff timer, make the
// watcher issue the request immediately with _watch.
watch.index = newIndex;
var oldRetry = watch._retry;
watch._retry = function () {
watch._retry = oldRetry;
watch._watch();
}
}
});
I've now added basic resync to trunk.
@tjanczuk I'm thinking about something like the following timeline:
client A is now resynced, but missed the key='value 2' update.
Now I just emit a 'resync' event from the watcher, so that you can be sure to check manually for missed values if needed
Maybe we could do the following on outdated and cleared index:
Or am I overthinking things? :)
Published, could you try out version 2.1.5?
I think you are right; what I proposed above would not address the cases of a race condition between broken network connections and updates on the server.
I would suggest slight optimization over what you describe above:
As an aside, a useful feature in node-etcd would be a modified
event that only fires when the value has actually changed.
Actually I think I am overthinking it now. If there was an error due to network issues, and the index has become out of scope in the meantime, the next successful request will error with a 401 code, and your case here: https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L71-L72 should cover it.
So looking at your code it seems everything should work correctly. The one slight optimization that can save a roundtrip to etcd is to also advance the index in case of the empty long poll response (https://github.com/stianeikeland/node-etcd/blob/master/src/watcher.coffee#L75-L76) - this is to save an occasional 401 for infrequently changing values.
The 2.1.5 appears to perform well for me in a long-running/infrequent-changing scenario. Thanks for the quick turnaround!
Thanks! Also agree with your optimizations.
If there are no changes in the directory you are watching and etcd index has changed more than 1000 times the reconnection will continuously fail. This can happen quite often when using Fleet as it writes heartbeats a lot.
I think the reconnection should take the index returned in the 401 error.