mediocregopher / radix.v2

Redis client for Go
http://godoc.org/github.com/mediocregopher/radix.v2
MIT License
433 stars 92 forks source link

sentinel does not sense switch-master #41

Closed kuangchanglang closed 8 years ago

kuangchanglang commented 8 years ago

Hi, i'm using radix v1, but I think this problem has no relation to version.

The problem occurred when redis switch master automatically, sentinel log shows all my three sentinels are informed switch-master signal.

Sep  7 22:10:51 redis-sentinel03-30015 redis-sentinel[1085]: +switch-master  
Sep  7 22:10:51 redis-sentinel02-30015 redis-sentinel[1382]: +switch-master  
Sep  7 22:10:53 redis-sentinel01-30015 redis-sentinel[1124]: +switch-master

However, my program does not sense this action, it still use old master connections. Unfortunately "READONLY" error is return for write commands.

I tried master switch manually through sentinel failover command, also forced switch by killing master, but the above problem didn't show up again. So it may be problem in extreme condition.

What is that possible reason? And what can I do to make my program more robust?

mediocregopher commented 8 years ago

Hey there! Could this problem be related to how long the connection to the sentinel instance is being held open? In other words, if you leave the system sitting there for a very long time and then try a manual failover does ignore it then?

It looks like the sentinel code isn't doing any pinging with the sentinel instance, so that could very well be the problem.

kuangchanglang commented 8 years ago

The connection is being held about 7 days. I try another system which is also running around 7 days, it received switch-master immediately.

Connection to 26379 keep ESTABLISHED, why is it possible message through this connection lost?

mediocregopher commented 8 years ago

I've made some much needed updates to the pubsub package, which allowed me to have the sentinel ping the sentinel instance periodically. If the issue you were having was indeed that that connection was timing out then this should at least give you an error in that case, and not just silently fail.

why is it possible message through this connection lost?

Unfortunately, unless both the server and client turn on tcp keepalive (which the client has to explicitly do when creating the connection, and the server has to enable using sysctl), it's very difficult for linux to detect that an idle connection has been killed in some kind of network outage. An application layer ping (like the one I just added) is pretty much the only way to solve this, unfortunately.

mediocregopher commented 8 years ago

Forgot to add, I've left my changes in that branch for now, would it be possible for you to build off of that branch and let me know if that fixes your problems?

kuangchanglang commented 8 years ago

Hi @mediocregopher , I blocked the connection between sentinel and my app by iptbles, and the problem reproduced exactly as the same as previous one. Switch-master signal was lost for any network reason.

I didn't try your branch since my code is incompatible with radix.v2, but your idea looks good on this problem, I'll try it later. Thanks