getBulk() doesn't use FailureMode

GoogleCodeExporter commented 8 years ago

Setup:
-multiple memcached servers
-ConnectionFactory.setFailureMode(FailureMode.Cancel)

Conditions:
-one of the memcached servers goes down, or is restarted

Observations:
-single key operations are immediately canceled (throws a CancellationException)
-multi-key operations (asyncGetbulk()) do not get cancelled. Instead they will 
timeout on the inactive node.

The cause seems to be the code in MemcachedClient.asyncGetBulk(): currently 
this code doesn't check the FailureMode value, only the node's active status. 
If a node is inactive, it emulates the Redistribute failure mode.

The attached patch checks the ConnectionFactory's failure mode, and emulates 
the behavior of MemcachedConnection.addOperation:
-if the node is active or FailureMode is Retry, use the primary node
-if the node is inactive and FailureMode is Cancel, don't create an operation 
(so no value will be returned)
-otherwise, redistribute the key (current behavior)

This patch is not perfect:
-it uses the ConnectionFactory failure mode, not the node's connection's 
FailureMode value (not visible); I'm pretty sure the values will be the same 
though.
-it doesn't throw a CancellationException if the FailureMode is Cancel, and a 
node is inactive. This is a compromise. The code could throw a 
CancellationException when a node is down, but it seems very inefficient if a 
single key -out of many- is currently inaccessible. Instead, this change will 
cause in a "cache miss" for this key, but all the other values will be 
returned. If the code then tries to set the missing value.

This compromise is acceptable for us: we're looking for as little service 
impact as possible when one of our memcacehd servers goes down. The current 
behavior (timeout) causes a big pile-up and cascading timeouts.

Original issue reported on code.google.com by da...@feedly.com on 23 Jul 2015 at 8:09

GoogleCodeExporter commented 8 years ago

I just realized another potential bug fixed by this patch: if FailureMode is 
set to Retry, and a node becomes inactive, the getBulk() methods will not use 
the primary node. Instead, they will behave like the Redistribute mode, 
potentially fetching the value from the wrong node.
The patch should fix this issue, and retry on the primary node.

Original comment by da...@feedly.com on 23 Jul 2015 at 9:06

GoogleCodeExporter commented 8 years ago

Re-logged in Couchbase's issue tracking, since apparently this one is obsolete:
https://issues.couchbase.com/browse/SPY-188

Original comment by da...@feedly.com on 24 Jul 2015 at 9:07

GoogleCodeExporter commented 8 years ago

Original comment by da...@feedly.com on 24 Jul 2015 at 9:29

Attachments:

spymemcached.patch

jlorenzen / spymemcached

getBulk() doesn't use FailureMode #322