Initial failover no more happening with docker 18.06.1-ce and 18.06.0-ce

ypereirareis commented 6 years ago

Hi,

after upgrading to docker 18.06.1-ce, the initial failover when removing redis-zero is no more happening:

Sentinel logs :

>docker logs -f cache_redis-sentinel.kv6tvkjf1pe4wti7iq1h909jo.2l557usypfqotzon8afqx1tfa
17:X 14 Sep 08:38:51.430 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
17:X 14 Sep 08:38:51.430 # Redis version=4.0.9, bits=64, commit=00000000, modified=0, pid=17, just started
17:X 14 Sep 08:38:51.430 # Configuration loaded
17:X 14 Sep 08:38:51.432 * Running mode=sentinel, port=26379.
17:X 14 Sep 08:38:51.432 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
17:X 14 Sep 08:38:51.442 # Sentinel ID is 499f4b1712ba02ac782d364eea425c755eafbe9c
17:X 14 Sep 08:38:51.443 # +monitor master redismaster 10.0.0.38 6379 quorum 2
17:X 14 Sep 08:38:53.778 * +sentinel sentinel 516b21ecc1df8f716ca97f45a00d91808edd05f8 10.0.0.43 26379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:38:56.016 * +sentinel sentinel 45ab2152f72756d4808d2719b7efcca382dc549f 10.0.0.41 26379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:01.527 * +slave slave 10.0.0.3:6379 10.0.0.3 6379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:01.535 * +slave slave 10.0.0.4:6379 10.0.0.4 6379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:01.537 * +slave slave 10.0.0.2:6379 10.0.0.2 6379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:02.545 # +sdown slave 10.0.0.4:6379 10.0.0.4 6379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:02.545 # +sdown slave 10.0.0.2:6379 10.0.0.2 6379 @ redismaster 10.0.0.38 6379
17:X 14 Sep 08:39:02.545 # +sdown slave 10.0.0.3:6379 10.0.0.3 6379 @ redismaster 10.0.0.38 6379

Redis logs :

>docker logs -f cache_redis.kv6tvkjf1pe4wti7iq1h909jo.97f3k98b4hiijljxx6gqfne7y
15:C 14 Sep 08:38:51.556 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
15:C 14 Sep 08:38:51.556 # Redis version=4.0.9, bits=64, commit=00000000, modified=0, pid=15, just started
15:C 14 Sep 08:38:51.556 # Configuration loaded
15:S 14 Sep 08:38:51.558 * Running mode=standalone, port=6379.
15:S 14 Sep 08:38:51.558 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
15:S 14 Sep 08:38:51.558 # Server initialized
15:S 14 Sep 08:38:51.558 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
15:S 14 Sep 08:38:51.558 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
15:S 14 Sep 08:38:51.558 * Ready to accept connections
15:S 14 Sep 08:38:51.558 * Connecting to MASTER 10.0.0.38:6379
15:S 14 Sep 08:38:51.558 * MASTER <-> SLAVE sync started
15:S 14 Sep 08:38:51.558 * Non blocking connect for SYNC fired the event.
15:S 14 Sep 08:38:51.559 * Master replied to PING, replication can continue...
15:S 14 Sep 08:38:51.559 * Partial resynchronization not possible (no cached master)
15:S 14 Sep 08:38:51.560 * Full resync from master: 5b0f1449d4de973cc59ee8fe4b299ca678fb947b:0
15:S 14 Sep 08:38:51.619 * MASTER <-> SLAVE sync: receiving 175 bytes from master
15:S 14 Sep 08:38:51.619 * MASTER <-> SLAVE sync: Flushing old data
15:S 14 Sep 08:38:51.619 * MASTER <-> SLAVE sync: Loading DB in memory
15:S 14 Sep 08:38:51.619 * MASTER <-> SLAVE sync: Finished with success

### REMOVING redis-zero HERE

15:S 14 Sep 08:39:40.293 # Connection with master lost.
15:S 14 Sep 08:39:40.298 * Caching the disconnected master state.
15:S 14 Sep 08:39:40.693 * Connecting to MASTER 10.0.0.38:6379
15:S 14 Sep 08:39:40.693 * MASTER <-> SLAVE sync started
15:S 14 Sep 08:39:51.020 # Error condition on socket for SYNC: Host is unreachable
15:S 14 Sep 08:39:51.727 * Connecting to MASTER 10.0.0.38:6379
15:S 14 Sep 08:39:51.728 * MASTER <-> SLAVE sync started
15:S 14 Sep 08:39:54.092 # Error condition on socket for SYNC: Host is unreachable
15:S 14 Sep 08:39:54.737 * Connecting to MASTER 10.0.0.38:6379
15:S 14 Sep 08:39:54.738 * MASTER <-> SLAVE sync started
15:S 14 Sep 08:39:57.164 # Error condition on socket for SYNC: Host is unreachable

A docker bug or maybe a BC in last docker releases.... Do you have more info about this ?

When downgrading to docker-ce=18.03.1~ce-0~debian everything is fine

Thanks for this project ! :smiley:

thomasjpfan commented 6 years ago

The code was updated and the integration tests now uses docker 18.06.1. One of the first things the integration tests does is to remove the redis-zero service. Please try the updated releases tagged: 1.0.2-redis-4.0.11.

ypereirareis commented 6 years ago

Hi,

Thanks for the fix !! What was the problem exactly with slaves IPs ?

thomasjpfan commented 6 years ago

The initial slave IPs that the initial redis master redis-zero were all the same. When creating the 3 redis replicas, docker creates one IP to do round robin between the 3 instances. This one IP was being used by redis-zero to locate the slave instances on the overlay network. My update forces the slaves to announce their distinct IP address, so the master can distinguish between then.

thomasjpfan / redis-cluster-docker-swarm

Initial failover no more happening with docker 18.06.1-ce and 18.06.0-ce #4