microsoftarchive / redis

Redis is an in-memory database that persists on disk. The data model is key-value, but many different kind of values are supported: Strings, Lists, Sets, Sorted Sets, Hashes
http://redis.io
Other
20.81k stars 5.37k forks source link

Slave sending IP address of unbound NIC to master #340

Open joecoolish opened 8 years ago

joecoolish commented 8 years ago

I have 4 servers (Virtual hosted in Hyper-V) each with 2 NICs; one NIC is for Windows Network Load-balancing, the other is for network connectivity.

They are:

M1: 192.168.1.44 / 192.168.1.43 M2: 192.168.1.55 / 192.168.1.54 M3: 192.168.1.66 / 192.168.1.65 M4: 192.168.1.77 / 192.168.1.76

Where the first IP is the WNLB IP address, and the second is the regular network IP

M1 is the Master and I want to make the other 3 Slaves. All of their redis.config files look like:

port 6379
tcp-backlog 511

bind 192.168.1.43 #54,65,76

timeout 0
tcp-keepalive 0
...

When I make M2 and M3 slaves of M1 and run info on M1, I get the following text in the # Replication section:

# Replication
role:master
connected_slaves:2
slave0:ip=192.168.1.54,port=6379,state=online,offset=123456,lag=0
slave1:ip=192.168.1.65,port=6379,state=online,offset=123456,lag=0
...

This is correct, and the IP addresses M2 and M3 have configured to bind to are correctly showing up as slaves of M1. Now, if I run slaveof 192.168.1.43 6379 on M4 then run info on M1, the # Replication section reads:

# Replication
role:master
connected_slaves:3
slave0:ip=192.168.1.54,port=6379,state=online,offset=123456,lag=0
slave1:ip=192.168.1.65,port=6379,state=online,offset=123456,lag=0
slave1:ip=192.168.1.77,port=6379,state=online,offset=123456,lag=0
...

192.168.1.77 doesn't have a redis server bound to that IP address, and so if I attach a sentinel to that server, I immediately get a [DATE] # +sdown slave 192.168.1.77:6379 192.168.1.77 6379 @ webredis 192.168.1.43 6379

The correct IP address should be 192.168.1.76, which is what is configured in the redis.config file for M4.

I've reset every redis server in the cluster several times, and it always comes back that M4 sends the wrong IP address. Is there any reason why this would happen?

rpannell commented 8 years ago

@joecoolish I am seeing the same issue on our end. I have bind and announce-ip on the sentinels configured. Check out sentinel config and you will probably see multiple slaves with the the different ip-address. BIND only deals with the address that the redis server will listen to but doesn't have anything (that I have seen) on what ip address it will announce when attaching to a master. Would love to see an announce-ip setting for redis as well.

But even with that odd ip addresses, everything works (like a champ) so I just ignore it.

joecoolish commented 8 years ago

I'm glad I'm not the only one that is seeing this!

I'm having an issue that I thought was explained by the incorrect slave IP addresses. Before I can move forward with redis in production, I need to be able to show Disaster Recovery and High Availability.

The test I need to successfully complete is run a program that increments a number in redis by 1 ever 100ms and show how the sentinels can help reestablish connection with the master database in case of catastrophic failure (hard shutdown). I'm taking M1-4 above, hard shutdown whichever VM happens to be the Master and showing that the Sentinels will recover the connection.

Since all of the VMs are behind a load balancer, I can connect to port 26379 on the Cluster IP and (in theory) always get a sentinel. Then I just ask where the master is and then subscribe to the +switch-master message on the sentinel. When I receive this message, I perform my failover logic and start all over again.

That all works until I get to only M3 and M4 being up. When I hard shutdown M3, the whole service goes down because the sentinels want to attach to M4 on 192.168.1.77. Once I figure out how to fix this, I think I can move ahead with actually leveraging redis in my prod apps. So close!

rpannell commented 8 years ago

First, really double check the config files, kinda sounds like one of the binds (and announce) are off. Also how many sentinels do you have? I have read to have 1 more than you have redis servers (so 2 redis servers should have 3 sentinels, 4 should have 5, etc) and have the quorum as a majority.

I saw some odd issues with 2 sentinels with the quorum as 1. I believe it was something similar to what you are seeing. Currently we only have 2 redis servers (M1, S1) with 3 sentinels. One sentinel on each redis server box and another off on another random web server box.

joecoolish commented 8 years ago

I have 4 servers in dev and 8 in prod, so how would you recommend I configure the slaves and masters? I would like to have a sentinel on all of the servers so I can point all my clients to the CIP.

I double check and all the sentinels are announcing the correct IPs (no duplicates!). The binds are all pointing to the correct IP as well.

I'm changing my test to just stopping the redis service and keep the sentinel service running, but I'm seeing the same result. I'm getting the following sentinel message when I take M3 down:

[DATE] # +vote-for-leader [GUID]
[DATE] # +odown master webredis 192.168.1.65 6379 #quorum 3/3
[DATE] # Next failover delay: I will not start a failover before [DATE]

And this is with M4 still running. M4 should be elected as the next master, but it's not visible because the only record that the sentinels have is for 192.168.1.77 and not 192.168.1.76.

I guess I could write in the governance doc that there should always be 3+ servers running and that DR/HA is compromised when the server count gets to 2, but that would make for a tougher sell.

Any suggestions?

rpannell commented 8 years ago

I remember that error! And I am not 100% what cleared it!

Is your quorum 3? So 3 of the 4?

Also, check your bottom of the sentinel config file and look for "sentinel known-sentinel" which should have the IP address of the sentinel. Make sure you don't have duplicates there from the previous issues from announce-ip. I think that was my problem. I had to shutdown each sentinel and clear them out (just deleted them outright) and restart them back up. That should clear out some with duplicate "known-sentinel" Start with 3 and 4 first.

On Fri, Oct 16, 2015 at 3:45 PM, joecoolish notifications@github.com wrote:

I have 4 servers in dev and 8 in prod, so how would you recommend I configure the slaves and masters? I would like to have a sentinel on all of the servers so I can point all my clients to the CIP.

I double check and all the sentinels are announcing the correct IPs (no duplicates!). The binds are all pointing to the correct IP as well.

I'm changing my test to just stopping the redis service and keep the sentinel service running, but I'm seeing the same result. I'm getting the following sentinel message when I take M3 down:

[DATE] # +vote-for-leader [GUID] [DATE] # +odown master webredis 192.168.1.65 6379 #quorum 3/3 [DATE] # Next failover delay: I will not start a failover before [DATE]

And this is with M4 still running. M4 should be elected as the next master, but it's not visible because the only record that the sentinels have is for 192.168.1.77 and not 192.168.1.76.

I guess I could write in the governance doc that there should always be 3+ servers running and that DR/HA is compromised when the server count gets to 2, but that would make for a tougher sell.

Any suggestions?

— Reply to this email directly or view it on GitHub https://github.com/MSOpenTech/redis/issues/340#issuecomment-148817013.

-Rodney Pannell- Mobile: 828.773.5209

joecoolish commented 8 years ago

I think the 3/3 is my quorum number. I have 4 sentinels configured to have a 3 sentinel quorum.

I'm going to work on hardening DR/HA with 1 server going down and up and see if that is enough to get me into prod.

I do agree that it would be great to have an announce-ip function for redis server. That would solve my issue.

rpannell commented 8 years ago

Do check your sentinel configs as well. I do remember needing to clean up the bottom piece of the config file after I pushed out the sentinel with the announce-ip address and I was getting some odd issues in test until I cleaned them up.

On Fri, Oct 16, 2015 at 4:22 PM, joecoolish notifications@github.com wrote:

I think the 3/3 is my quorum number. I have 4 sentinels configured to have a 3 sentinel quorum.

I'm going to work on hardening DR/HA with 1 server going down and up and see if that is enough to get me into prod.

I do agree that it would be great to have an announce-ip function for redis server. That would solve my issue.

— Reply to this email directly or view it on GitHub https://github.com/MSOpenTech/redis/issues/340#issuecomment-148824119.

-Rodney Pannell- Mobile: 828.773.5209