ledgetech / lua-resty-redis-connector

Connection utilities for lua-resty-redis
234 stars 71 forks source link

fix bugs in get_slaves #32

Closed wangrzneu closed 4 years ago

wangrzneu commented 4 years ago

"flags" should be used to check whether the slave can be visited instead of master-link-status. Whe the master is down, master-link-status of all slaves is "down".

wangrzneu commented 4 years ago

@pintsized Please help to review this PR

pintsized commented 4 years ago

Hi. This breaks the tests it seems - specifically if you remove a slave then it is now still returned from get_slaves.

Can you explain a bit more about the case you're trying to fix? So the master goes down, and until a slave is promoted you can't enumerate the slaves? Are you sure that's a bug? I'm happy to be convinced, but we need working tests.

ryaneorth commented 4 years ago

Hi @pintsized - I recently ran into a similar scenario as the one outlined in this PR. Specifically, the client kept on returning a slave which no longer existed. I retrieved replica information by performing a sentinel replica action on the master and have the following results returned:

1)  1) "name"
    2) "100.123.184.69:6379"
    3) "ip"
    4) "100.123.184.69"
    5) "port"
    6) "6379"
    7) "runid"
    8) "9ee6ce3f8d83dbf351265867337979ca395b6267"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "5"
   19) "last-ping-reply"
   20) "5"
   21) "down-after-milliseconds"
   22) "3000"
   23) "info-refresh"
   24) "8850"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "1684062408"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "100.126.237.157"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "2371510296"
2)  1) "name"
    2) "100.96.34.111:6379"
    3) "ip"
    4) "100.96.34.111"
    5) "port"
    6) "6379"
    7) "runid"
    8) ""
    9) "flags"
   10) "s_down,slave,disconnected"
   11) "link-pending-commands"
   12) "3"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "3011830468"
   17) "last-ok-ping-reply"
   18) "3011830468"
   19) "last-ping-reply"
   20) "3011830468"
   21) "s-down-time"
   22) "3011827434"
   23) "down-after-milliseconds"
   24) "3000"
   25) "info-refresh"
   26) "1598467752378"
   27) "role-reported"
   28) "slave"
   29) "role-reported-time"
   30) "3011830468"
   31) "master-link-down-time"
   32) "0"
   33) "master-link-status"
   34) "err"
   35) "master-host"
   36) "?"
   37) "master-port"
   38) "0"
   39) "slave-priority"
   40) "100"
   41) "slave-repl-offset"
   42) "0"
3)  1) "name"
    2) "100.122.217.79:6379"
    3) "ip"
    4) "100.122.217.79"
    5) "port"
    6) "6379"
    7) "runid"
    8) "edc3ba59a74b53c5428520d0217e78d08ae6be39"
    9) "flags"
   10) "s_down,slave"
   11) "link-pending-commands"
   12) "101"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "1815258279"
   17) "last-ok-ping-reply"
   18) "1815259288"
   19) "last-ping-reply"
   20) "1815259288"
   21) "s-down-time"
   22) "1815255275"
   23) "down-after-milliseconds"
   24) "3000"
   25) "info-refresh"
   26) "1815265665"
   27) "role-reported"
   28) "slave"
   29) "role-reported-time"
   30) "3011796858"
   31) "master-link-down-time"
   32) "0"
   33) "master-link-status"
   34) "ok"
   35) "master-host"
   36) "100.126.237.157"
   37) "master-port"
   38) "6379"
   39) "slave-priority"
   40) "100"
   41) "slave-repl-offset"
   42) "1814925860"
4)  1) "name"
    2) "100.112.170.166:6379"
    3) "ip"
    4) "100.112.170.166"
    5) "port"
    6) "6379"
    7) "runid"
    8) "d1319a4fa56cd72a4099385d9186c74205f5e1cb"
    9) "flags"
   10) "s_down,slave"
   11) "link-pending-commands"
   12) "101"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "1684104760"
   17) "last-ok-ping-reply"
   18) "1684105595"
   19) "last-ping-reply"
   20) "1684105595"
   21) "s-down-time"
   22) "1684101749"
   23) "down-after-milliseconds"
   24) "3000"
   25) "info-refresh"
   26) "1684108317"
   27) "role-reported"
   28) "slave"
   29) "role-reported-time"
   30) "1815069451"
   31) "master-link-down-time"
   32) "0"
   33) "master-link-status"
   34) "ok"
   35) "master-host"
   36) "100.126.237.157"
   37) "master-port"
   38) "6379"
   39) "slave-priority"
   40) "100"
   41) "slave-repl-offset"
   42) "1855293823"
5)  1) "name"
    2) "100.122.217.124:6379"
    3) "ip"
    4) "100.122.217.124"
    5) "port"
    6) "6379"
    7) "runid"
    8) ""
    9) "flags"
   10) "s_down,slave"
   11) "link-pending-commands"
   12) "93"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "3011830468"
   17) "last-ok-ping-reply"
   18) "3011830468"
   19) "last-ping-reply"
   20) "3011830468"
   21) "s-down-time"
   22) "3011827434"
   23) "down-after-milliseconds"
   24) "3000"
   25) "info-refresh"
   26) "1598467752378"
   27) "role-reported"
   28) "slave"
   29) "role-reported-time"
   30) "3011830468"
   31) "master-link-down-time"
   32) "0"
   33) "master-link-status"
   34) "err"
   35) "master-host"
   36) "?"
   37) "master-port"
   38) "0"
   39) "slave-priority"
   40) "100"
   41) "slave-repl-offset"
   42) "0"
6)  1) "name"
    2) "100.124.93.111:6379"
    3) "ip"
    4) "100.124.93.111"
    5) "port"
    6) "6379"
    7) "runid"
    8) "5edcbaeebe449307788a75956b2d1723601938c5"
    9) "flags"
   10) "s_down,slave"
   11) "link-pending-commands"
   12) "101"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "1832777937"
   17) "last-ok-ping-reply"
   18) "1832778953"
   19) "last-ping-reply"
   20) "1832778953"
   21) "s-down-time"
   22) "1832774901"
   23) "down-after-milliseconds"
   24) "3000"
   25) "info-refresh"
   26) "1832780780"
   27) "role-reported"
   28) "slave"
   29) "role-reported-time"
   30) "3011830468"
   31) "master-link-down-time"
   32) "0"
   33) "master-link-status"
   34) "ok"
   35) "master-host"
   36) "100.126.237.157"
   37) "master-port"
   38) "6379"
   39) "slave-priority"
   40) "100"
   41) "slave-repl-offset"
   42) "1809529834"
7)  1) "name"
    2) "100.114.69.29:6379"
    3) "ip"
    4) "100.114.69.29"
    5) "port"
    6) "6379"
    7) "runid"
    8) "586419cadae608acefd0a6a4346689c1dd8c1d43"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "5"
   19) "last-ping-reply"
   20) "5"
   21) "down-after-milliseconds"
   22) "3000"
   23) "info-refresh"
   24) "5421"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "1832556335"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "100.126.237.157"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "2371511544"
8)  1) "name"
    2) "100.126.237.171:6379"
    3) "ip"
    4) "100.126.237.171"
    5) "port"
    6) "6379"
    7) "runid"
    8) "e8d13289f797e9e2d086d24a97fd0d4c90ff7482"
    9) "flags"
   10) "slave"
   11) "link-pending-commands"
   12) "0"
   13) "link-refcount"
   14) "1"
   15) "last-ping-sent"
   16) "0"
   17) "last-ok-ping-reply"
   18) "6"
   19) "last-ping-reply"
   20) "6"
   21) "down-after-milliseconds"
   22) "3000"
   23) "info-refresh"
   24) "9068"
   25) "role-reported"
   26) "slave"
   27) "role-reported-time"
   28) "3011826927"
   29) "master-link-down-time"
   30) "0"
   31) "master-link-status"
   32) "ok"
   33) "master-host"
   34) "100.126.237.157"
   35) "master-port"
   36) "6379"
   37) "slave-priority"
   38) "100"
   39) "slave-repl-offset"
   40) "2371510296"

These results are interesting because at the time the action was performed there was 1 master and 3 healthy/running slaves. However, looking at these results above you will notice that 5 of the slaves are marked as s_down, but 3 of those have master-link-status set as ok. I was able to verify that all of the slaves which have a flag of s_down no longer exist/are actually down. Therefore, it seems it is a valid scenario to be in where a slave is no longer available, yet their master-link-status is still set to ok. Given this, would you be open to me re-submitting a pull request where we check for the s_down flag in addition to the master-link-status of each slave?

For what its worth, other client libraries seem to have faced similar problems and come to the same conclusion: https://github.com/redis/redis-rb/issues/827

pintsized commented 4 years ago

This was fixed by #36