master connect timeout neeed for replication SLAVEOF command to prevent DoS

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. start redis
2. initiate replication to a firewalled port on a remote host
3. wait

What is the expected output? What do you see instead?

If you have two hosts running redis, A and B, that are separated by a 
firewall and tell B to replicate from A and get the port wrong, redis-
server on B becomes unresponsive.

Example:

---snip---

$ telnet localhost 63790
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
slaveof 10.134.26.1 63790
+OK
quit
Connection closed by foreign host.

---snip---

In that example, localhost is the slave (node B).  There was a multi-minute 
delay between the "+OK" response and the "quit" being handled.  Port 63790 
is firwalled between the two, so B will never see a connection refused from 
A.

In another window, I do this on node B (the slave):

---snip---

$ telnet localhost 63790
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
info
$464
redis_version:1.2.5
arch_bits:64
multiplexing_api:epoll
uptime_in_seconds:59274
uptime_in_days:0
connected_clients:1
connected_slaves:0
used_memory:625905
used_memory_human:611.24K
changes_since_last_save:0
bgsave_in_progress:0
last_save_time:1268436817
bgrewriteaof_in_progress:0
total_connections_received:2
total_commands_processed:1
role:slave
master_host:10.134.26.1
master_port:63790
master_link_status:down
master_last_io_seconds_ago:-1

---snip---

In this case, the "info" response was delayed by several minutes.  In other 
words, the entire redis-server appears to be unresponsive while there's a 
pending "slaveof" command trying to run.

What version of the product are you using? On what operating system?

1.2.5 (patched for the tmpfile replication issue) on 64bit OpenSUSE.

Please provide any additional information below.

It is my belief that "SLAVEOF" is implemented in a sync, blocking 
technique.  That's OK given what it does, but that also means the initial 
connect timeout (and number of retries) should both probably be (a) low, 
and (b) configurable by the administrator.

Otherwise, the SLAVEOF command becomes an easy DoS vector.

Original issue reported on code.google.com by jzaw...@gmail.com on 13 Mar 2010 at 4:07

GoogleCodeExporter commented 8 years ago

Adding to this, we were bitten again by this issue recently.

Ideally, I'd like two config file vars to help.  One would set the connect 
timeout between a slave and its master (letting the OS decide waits far too 
long!).  Secondly, being able to set the number of retries the slave will do 
will help to mitigate this as well.  In our environment,  I want the slave to 
have a 3 second connect timeout to the master and to try at most 3 times before 
giving up.  Then our monitoring system can catch it and get a human involved.

Otherwise, the slave is mostly unresponsive (long pauses in responses to other 
commands) while it's waiting for the timeout to fire during socket connection.

Original comment by jzaw...@gmail.com on 30 Jul 2010 at 3:36

GoogleCodeExporter commented 8 years ago

FWIw, I'd call these master_connect_timeout and master_connect_retries (or 
something similar).

Original comment by jzaw...@gmail.com on 30 Jul 2010 at 3:40

GoogleCodeExporter commented 8 years ago

Issue accepted, this is a very bad thing... either the reconnection should be 
made async via the event loop (a bit more complex code wise but probably the 
very best approach after all) or should have a sane timeout.

@jzawodn: about the max number of attempts, the problem is that a slave that 
lost the connection, after the N attempts will become a pretty "strange" node 
in the network. It should at least deny client connections when the master 
status is no longer active... (only allowing the SLAVE and INFO command to be 
issued).

Definitely something to fix, but I'm still unsure about the right thing to do...

Original comment by anti...@gmail.com on 27 Aug 2010 at 10:52

Changed state: Accepted

swatantra / redis

master connect timeout neeed for replication SLAVEOF command to prevent DoS #181