Closed wangrzneu closed 4 years ago
@pintsized Please help to review this PR
Hi. This breaks the tests it seems - specifically if you remove a slave then it is now still returned from get_slaves
.
Can you explain a bit more about the case you're trying to fix? So the master goes down, and until a slave is promoted you can't enumerate the slaves? Are you sure that's a bug? I'm happy to be convinced, but we need working tests.
Hi @pintsized - I recently ran into a similar scenario as the one outlined in this PR. Specifically, the client kept on returning a slave which no longer existed. I retrieved replica information by performing a sentinel replica
action on the master and have the following results returned:
1) 1) "name"
2) "100.123.184.69:6379"
3) "ip"
4) "100.123.184.69"
5) "port"
6) "6379"
7) "runid"
8) "9ee6ce3f8d83dbf351265867337979ca395b6267"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "5"
19) "last-ping-reply"
20) "5"
21) "down-after-milliseconds"
22) "3000"
23) "info-refresh"
24) "8850"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "1684062408"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "100.126.237.157"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "2371510296"
2) 1) "name"
2) "100.96.34.111:6379"
3) "ip"
4) "100.96.34.111"
5) "port"
6) "6379"
7) "runid"
8) ""
9) "flags"
10) "s_down,slave,disconnected"
11) "link-pending-commands"
12) "3"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "3011830468"
17) "last-ok-ping-reply"
18) "3011830468"
19) "last-ping-reply"
20) "3011830468"
21) "s-down-time"
22) "3011827434"
23) "down-after-milliseconds"
24) "3000"
25) "info-refresh"
26) "1598467752378"
27) "role-reported"
28) "slave"
29) "role-reported-time"
30) "3011830468"
31) "master-link-down-time"
32) "0"
33) "master-link-status"
34) "err"
35) "master-host"
36) "?"
37) "master-port"
38) "0"
39) "slave-priority"
40) "100"
41) "slave-repl-offset"
42) "0"
3) 1) "name"
2) "100.122.217.79:6379"
3) "ip"
4) "100.122.217.79"
5) "port"
6) "6379"
7) "runid"
8) "edc3ba59a74b53c5428520d0217e78d08ae6be39"
9) "flags"
10) "s_down,slave"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "1815258279"
17) "last-ok-ping-reply"
18) "1815259288"
19) "last-ping-reply"
20) "1815259288"
21) "s-down-time"
22) "1815255275"
23) "down-after-milliseconds"
24) "3000"
25) "info-refresh"
26) "1815265665"
27) "role-reported"
28) "slave"
29) "role-reported-time"
30) "3011796858"
31) "master-link-down-time"
32) "0"
33) "master-link-status"
34) "ok"
35) "master-host"
36) "100.126.237.157"
37) "master-port"
38) "6379"
39) "slave-priority"
40) "100"
41) "slave-repl-offset"
42) "1814925860"
4) 1) "name"
2) "100.112.170.166:6379"
3) "ip"
4) "100.112.170.166"
5) "port"
6) "6379"
7) "runid"
8) "d1319a4fa56cd72a4099385d9186c74205f5e1cb"
9) "flags"
10) "s_down,slave"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "1684104760"
17) "last-ok-ping-reply"
18) "1684105595"
19) "last-ping-reply"
20) "1684105595"
21) "s-down-time"
22) "1684101749"
23) "down-after-milliseconds"
24) "3000"
25) "info-refresh"
26) "1684108317"
27) "role-reported"
28) "slave"
29) "role-reported-time"
30) "1815069451"
31) "master-link-down-time"
32) "0"
33) "master-link-status"
34) "ok"
35) "master-host"
36) "100.126.237.157"
37) "master-port"
38) "6379"
39) "slave-priority"
40) "100"
41) "slave-repl-offset"
42) "1855293823"
5) 1) "name"
2) "100.122.217.124:6379"
3) "ip"
4) "100.122.217.124"
5) "port"
6) "6379"
7) "runid"
8) ""
9) "flags"
10) "s_down,slave"
11) "link-pending-commands"
12) "93"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "3011830468"
17) "last-ok-ping-reply"
18) "3011830468"
19) "last-ping-reply"
20) "3011830468"
21) "s-down-time"
22) "3011827434"
23) "down-after-milliseconds"
24) "3000"
25) "info-refresh"
26) "1598467752378"
27) "role-reported"
28) "slave"
29) "role-reported-time"
30) "3011830468"
31) "master-link-down-time"
32) "0"
33) "master-link-status"
34) "err"
35) "master-host"
36) "?"
37) "master-port"
38) "0"
39) "slave-priority"
40) "100"
41) "slave-repl-offset"
42) "0"
6) 1) "name"
2) "100.124.93.111:6379"
3) "ip"
4) "100.124.93.111"
5) "port"
6) "6379"
7) "runid"
8) "5edcbaeebe449307788a75956b2d1723601938c5"
9) "flags"
10) "s_down,slave"
11) "link-pending-commands"
12) "101"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "1832777937"
17) "last-ok-ping-reply"
18) "1832778953"
19) "last-ping-reply"
20) "1832778953"
21) "s-down-time"
22) "1832774901"
23) "down-after-milliseconds"
24) "3000"
25) "info-refresh"
26) "1832780780"
27) "role-reported"
28) "slave"
29) "role-reported-time"
30) "3011830468"
31) "master-link-down-time"
32) "0"
33) "master-link-status"
34) "ok"
35) "master-host"
36) "100.126.237.157"
37) "master-port"
38) "6379"
39) "slave-priority"
40) "100"
41) "slave-repl-offset"
42) "1809529834"
7) 1) "name"
2) "100.114.69.29:6379"
3) "ip"
4) "100.114.69.29"
5) "port"
6) "6379"
7) "runid"
8) "586419cadae608acefd0a6a4346689c1dd8c1d43"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "5"
19) "last-ping-reply"
20) "5"
21) "down-after-milliseconds"
22) "3000"
23) "info-refresh"
24) "5421"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "1832556335"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "100.126.237.157"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "2371511544"
8) 1) "name"
2) "100.126.237.171:6379"
3) "ip"
4) "100.126.237.171"
5) "port"
6) "6379"
7) "runid"
8) "e8d13289f797e9e2d086d24a97fd0d4c90ff7482"
9) "flags"
10) "slave"
11) "link-pending-commands"
12) "0"
13) "link-refcount"
14) "1"
15) "last-ping-sent"
16) "0"
17) "last-ok-ping-reply"
18) "6"
19) "last-ping-reply"
20) "6"
21) "down-after-milliseconds"
22) "3000"
23) "info-refresh"
24) "9068"
25) "role-reported"
26) "slave"
27) "role-reported-time"
28) "3011826927"
29) "master-link-down-time"
30) "0"
31) "master-link-status"
32) "ok"
33) "master-host"
34) "100.126.237.157"
35) "master-port"
36) "6379"
37) "slave-priority"
38) "100"
39) "slave-repl-offset"
40) "2371510296"
These results are interesting because at the time the action was performed there was 1 master and 3 healthy/running slaves. However, looking at these results above you will notice that 5 of the slaves are marked as s_down
, but 3 of those have master-link-status
set as ok
. I was able to verify that all of the slaves which have a flag of s_down
no longer exist/are actually down. Therefore, it seems it is a valid scenario to be in where a slave is no longer available, yet their master-link-status
is still set to ok
. Given this, would you be open to me re-submitting a pull request where we check for the s_down
flag in addition to the master-link-status
of each slave?
For what its worth, other client libraries seem to have faced similar problems and come to the same conclusion: https://github.com/redis/redis-rb/issues/827
This was fixed by #36
"flags" should be used to check whether the slave can be visited instead of master-link-status. Whe the master is down, master-link-status of all slaves is "down".