Temporary packet loss causes permanent node hang

ha / doozerd

A consistent distributed data store.

MIT License

3.27k stars 266 forks source link

With the firedrill 3-node setup, dropping packets for >5 seconds:

sudo iptables -I INPUT --proto udp --dport 8047 -j DROP; sleep 7; sudo iptables -D INPUT --proto udp --dport 8047 -j DROP

(whether I block packets in one direction or both doesn't seem to affect behavior)

causes one or more of the nodes to get kicked out of the cluster, but the victim doesn't realize this happened and just hangs. This is true even after network connectivity is restored.

Interestingly, temporarily blocking the node on port 8047 often causes a different node get kicked. My latest run actually kicked the nodes on port 8046 and 8048, thus translating a single-node temporary outage into a cluster failure (as mailing list has told me, doozer doesn't recover from loss of quorum).

The kicked node never recovers, unless the process is restarted as a whole, but this might be related to #44.

DOOZER 2013/06/05 15:29:50.607045 p.seqn=473 m.next=181 DOOZER 2013/06/05 15:29:50.617452 p.seqn=473 m.next=181 DOOZER 2013/06/05 15:29:50.627544 p.seqn=473 m.next=181 DOOZER 2013/06/05 15:29:50.637069 p.seqn=473 m.next=181 DOOZER 2013/06/05 15:29:50.647252 p.seqn=473 m.next=181

ha / doozerd

Temporary packet loss causes permanent node hang #59