redis 集群 Replication is down, 但是 cluster 正常

shenkonghui commented 3 years ago

执行info replication看到的信息连接错误的master，status是down

$ redis -a xxx -h 10.244.214.112 info replication
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Replication
role:slave
master_host:10.244.61.221
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:0
slave_repl_offset:448
master_link_down_since_seconds:1629687024
slave_priority:100
slave_read_only:1
connected_slaves:0
master_replid:6e6b9eb0ce197c06d0a665952eae7350b42a81fb
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:448
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:268435456
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

但是执行cluster nodes 却看到所有的节点都是正常

$  redis -a xxx -h 10.244.214.112  cluster nodes
717f4a3766f0670838a62191fa40d21836f30921 10.244.214.112:6379@16379 myself,slave 040876a705fcb904c7ffc942b10a6dd1fd47ebd6 0 1629687038000 0 connected
91d11fd6ccd654bea9f7bd5c42a911c4fb1d1187 10.244.61.245:6379@16379 slave 5c328030476fc5b02358a6e7f3b5f1acefba5777 0 1629687039000 1 connected
040876a705fcb904c7ffc942b10a6dd1fd47ebd6 10.244.61.232:6379@16379 master - 0 1629687040122 2 connected 5461-10922
5c328030476fc5b02358a6e7f3b5f1acefba5777 10.244.214.65:6379@16379 master - 0 1629687039118 1 connected 0-5460
b40479a5da458f1ee73aabe9a65e1068e448505b 10.244.66.232:6379@16379 slave 3d7f6f18c9ab4807b836ff654be17175331bdd00 0 1629687037110 4 connected
3d7f6f18c9ab4807b836ff654be17175331bdd00 10.244.214.68:6379@16379 master - 0 1629687038000 4 connected 10923-16383

$ redis-cli -a xxx --cluster check 10.244.214.112 6379
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
10.244.61.232:6379 (040876a7...) -> 0 keys | 5462 slots | 1 slaves.
10.244.214.65:6379 (5c328030...) -> 0 keys | 5461 slots | 1 slaves.
10.244.214.68:6379 (3d7f6f18...) -> 0 keys | 5461 slots | 1 slaves.
[OK] 0 keys in 3 masters.
0.00 keys per slot on average.
>>> Performing Cluster Check (using node 10.244.214.112:6379)
S: 717f4a3766f0670838a62191fa40d21836f30921 10.244.214.112:6379
   slots: (0 slots) slave
   replicates 040876a705fcb904c7ffc942b10a6dd1fd47ebd6
S: 91d11fd6ccd654bea9f7bd5c42a911c4fb1d1187 10.244.61.245:6379
   slots: (0 slots) slave
   replicates 5c328030476fc5b02358a6e7f3b5f1acefba5777
M: 040876a705fcb904c7ffc942b10a6dd1fd47ebd6 10.244.61.232:6379
   slots:[5461-10922] (5462 slots) master
   1 additional replica(s)
M: 5c328030476fc5b02358a6e7f3b5f1acefba5777 10.244.214.65:6379
   slots:[0-5460] (5461 slots) master
   1 additional replica(s)
S: b40479a5da458f1ee73aabe9a65e1068e448505b 10.244.66.232:6379
   slots: (0 slots) slave
   replicates 3d7f6f18c9ab4807b836ff654be17175331bdd00
M: 3d7f6f18c9ab4807b836ff654be17175331bdd00 10.244.214.68:6379
   slots:[10923-16383] (5461 slots) master
   1 additional replica(s)

shenkonghui commented 3 years ago

测试了下failover功能是正常的，能恢复正常，但是数据好像丢失了(redis不保证强一致性，当复制落后并且发生切换数据是会丢失的)，经测试复制功能故障.

shenkonghui commented 3 years ago

重启该pod，恢复正常日志如下

1:M 24 Aug 2021 10:46:48.838 * DB loaded from disk: 0.000 seconds
1:M 24 Aug 2021 10:46:48.838 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:M 24 Aug 2021 10:46:48.838 * Ready to accept connections
1:S 24 Aug 2021 10:46:48.839 * Discarding previously cached master state.
1:S 24 Aug 2021 10:46:48.839 * Before turning into a replica, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.
1:S 24 Aug 2021 10:46:48.839 # Cluster state changed: ok
1:S 24 Aug 2021 10:46:49.843 * Connecting to MASTER 10.244.102.165:6379
1:S 24 Aug 2021 10:46:49.844 * MASTER <-> REPLICA sync started
1:S 24 Aug 2021 10:46:49.844 * Non blocking connect for SYNC fired the event.
1:S 24 Aug 2021 10:46:49.845 * Master replied to PING, replication can continue...
1:S 24 Aug 2021 10:46:49.846 * Trying a partial resynchronization (request cf007b56204d7eb04bd5a7a00b27c395b25f5235:267).
1:S 24 Aug 2021 10:46:49.847 * Full resync from master: 68f72905a9d0105cb9fa1edf001c97e9bce64bcb:0
1:S 24 Aug 2021 10:46:49.847 * Discarding previously cached master state.
1:S 24 Aug 2021 10:46:49.917 * MASTER <-> REPLICA sync: receiving 175 bytes from master
1:S 24 Aug 2021 10:46:49.917 * MASTER <-> REPLICA sync: Flushing old data
1:S 24 Aug 2021 10:46:49.917 * MASTER <-> REPLICA sync: Loading DB in memory
1:S 24 Aug 2021 10:46:49.917 * MASTER <-> REPLICA sync: Finished with success

shenkonghui commented 3 years ago

引起原因和 #80 一致，是master 和slave 同时重启导致的，采用meet命令可以让cluster nodes 状态恢复正常，但是部分会出现info replication 异常的情况

shenkonghui / issue

redis 集群 Replication is down, 但是 cluster 正常 #120