prashant-r / Scalaris

DHT Chord Transaction
Apache License 2.0
0 stars 0 forks source link

the ping messages to the crashed nodes won't stop #73

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. configure file: scalaris.cfg is not updated. scalaris.local.cfg:

{mgmt_server, {{127,0,0,1}, 14195, mgmt_server}}.
{known_hosts, [{{127,0,0,1}, 14195, service_per_vm}]}.

2. four nodes start on a single PC:(also happens when the nodes distributed 
on several PCs)
first_node.sh
join_node.sh
join_node.sh 2
join_node.sh 3

What is the expected output? What do you see instead?
when node 2 crashed, the other three nodes would print the following message, 
and won't stop any more.

[warn] "to > 127.0.0.1:14197" Connection failed, drop message 
{ping,{{{127,0,0,1},14195,<0.84.0>},c,{{127,0,0,1},14197,<10301.115.0>}}}
[warn] "to > 127.0.0.1:14197" Connection failed, drop message 
{ping,{{{127,0,0,1},14195,<0.106.0>},
       c,
       {node,{{127,0,0,1},14197,<10301.115.0>},
             40695820333747461610959776268268129841,0}}}

when node 2 joins in(bin/join_node.sh 2), the other nodes stop print the above 
message, but node 2 will not stop to print the following message.

[error] Discarding message 
{ping,{{{127,0,0,1},14195,<10043.106.0>},c,{node,{{127,0,0,1},14197,<0.115.0>},4
0695820333747461610959776268268129841,0}}} from <0.119.0> to <0.115.0> in an 
old incarnation (3) of this node (2)

It seems the other nodes are still trying to ping node 2.

I wonder if the cluster would full of such messages when the cluster is large 
enough.

What version of the product are you using? On what operating system?

version: scalaris-0.3.0(tar.gz)
os: ubuntu10.10
erlang-version: R14B02

Original issue reported on code.google.com by suleed....@gmail.com on 9 Sep 2011 at 10:15

GoogleCodeExporter commented 8 years ago
The ping messages come from the dead node cache which (after some node has been 
branded "crashed") try to reach the node again as it could be a temporary 
failure.

These send failures have been silenced in svn trunk a while ago along with 
other non-essential send operations, e.g. in the gossip and cyclon modules.

The messages that occur on a re-joined node ("Discarding message [...] in an 
old incarnation [...] of this node") have been fixed in rev 2208

Original comment by nico.kru...@googlemail.com on 10 Sep 2011 at 9:24