seyyed / scalaris

Automatically exported from code.google.com/p/scalaris
Apache License 2.0
0 stars 0 forks source link

A base can become denied in the case even if one node is failed #51

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
> What steps will reproduce the problem?
1. Start 4 or mode nodes in such case that two consecutive nodes possessing 
more than 25% of the space key and the first of its should be smaller than 25%. 
We need to ensure that the two replicas of most keys hit in those two nodes.
3. Initialize the base with some data.
2. Break the first one of this two nodes.

> What is the expected output? What do you see instead?
After that most of read/write operations will fail (despite the fact that  the 
stopped node contains no more than one replica of any key).

I investigate the problem and as I think I found the root of it. The reason is 
that the transaction starts insufficient number of TPs. This is due to the fact 
that the second node is not getting the message with a directive to initialize 
a TP. The lookup:unreliable_lookup/2 function used for this. But the routing 
logic need the previous (stopped) node to deliver the message and so the second 
node is unreachable for transactions too...

>What version of the product are you using? On what operating system?

svn 937

Original issue reported on code.google.com by serge.po...@gmail.com on 28 Jul 2010 at 3:02

GoogleCodeExporter commented 8 years ago
As for now when one node of the ring stopped then routing table of most of rest 
nodes become empty and not updates... And therefore routing is produced over 
successors but not by routing tables. And if successor is dead then routing 
stops.

Original comment by serge.po...@gmail.com on 28 Jul 2010 at 3:56

GoogleCodeExporter commented 8 years ago
I managed to reproduce the problem that you reported in issue #50 (see 
churn_SUITE). And I also found the code which caused the problem. In some 
cases, the routing tables were not fixed correctly. And therefore transaction 
requests failed, because only a few replicas could be contacted. I believed 
that I fixed the problem as of r939. It should also the problem you describe 
here. Could you please check with r939 again?

Original comment by schu...@gmail.com on 28 Jul 2010 at 4:03

GoogleCodeExporter commented 8 years ago
Sorry, I meant the problem, you described on the mailing list.

Original comment by schu...@gmail.com on 28 Jul 2010 at 4:04

GoogleCodeExporter commented 8 years ago
Look at the table:
"serge.localdomain" [-1,"dead node?",-2,-3,-4]  0: 
13949634270282579791514433256695289139  [1,2,3,"dead node?",4]  0   2028
"serge.localdomain" [-1,-2,"dead node?",-3,-4]  1: 
79609880164404348420880338882525604031  [1,2,"dead node?",3,4]  0   8884
"serge.localdomain" [-1,-2,-3,"dead node?",-4]  2: 
102253000696385556613599801511918946244 [1,"dead node?",2,3,4]  0   3110
"serge.localdomain" [-1,-2,-3,-4,"dead node?"]  3: 
192399439837642237228630612343177178734 ["dead node?",1,2,3,4]  2   12230
"serge.localdomain" ["dead node?",-1,-2,-3,-4]  4: 
339259843428845931632116029144159062869 [1,2,3,4,"dead node?"]  0   16933

Node with id 213771175941570743687568926069348066045 is stopped. 
The routing table of preceding node consists of two elements:
{2,
                  {102253000696385556613599801511918946244,
                   {node,{{127,0,0,1},14200,<9255.103.0>},
                         102253000696385556613599801511918946244,0},
                   nil,
                   {213771175941570743687568926069348066045,
                    {node,{{127,0,0,1},14196,<9096.103.0>},
                          213771175941570743687568926069348066045,0},
                    nil,nil}}}
The one of them is a dead node...

Original comment by serge.po...@gmail.com on 28 Jul 2010 at 4:07

GoogleCodeExporter commented 8 years ago
On svn 940 the problem is not fixed...

serge.localdomain"  [-1,"dead node?",-2,-3,-4]  0: 
87252012913850628004647166267571547849  [1,2,3,"dead node?",4]  0   17890
"serge.localdomain" [-1,-2,"dead node?",-3,-4]  1: 
117556503988519988366314956663609976271 [1,2,"dead node?",3,4]  0   4131
"serge.localdomain" [-1,-2,-3,"dead node?",-4]  2: 
120048988123494560763763655231343568851 [1,"dead node?",2,3,4]  0   330
"serge.localdomain" [-1,-2,-3,-4,"dead node?"]  3: 
151791367182612900689667052284998608104 ["dead node?",1,2,3,4]  2   4236
"serge.localdomain" ["dead node?",-1,-2,-3,-4]  4: 
295330337121951573407675286713157606348 [1,2,3,4,"dead node?"]  2   8868

RtSize of most nodes is empty. Transactions are failed.

Original comment by serge.po...@gmail.com on 28 Jul 2010 at 4:27

GoogleCodeExporter commented 8 years ago
this should be fixed as of r943

Original comment by nico.kru...@googlemail.com on 29 Jul 2010 at 4:05