seyyed / scalaris

Automatically exported from code.google.com/p/scalaris
Apache License 2.0
0 stars 0 forks source link

Timeout not handled during write #48

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Start some nodes
2. Wait some secs
3. Enter on a node: cs_api_v2:write("foo", bar).

What is the expected output? What do you see instead?
"ok" is expected. Instead, the output is:

2010-06-01 11:14:25 [error] Error: exception {{case_clause,timeout},
    [{rdht_tx_write,my_make_tlog_result_entry,2},
     {rdht_tx_write,on,2},
     {gen_component,loop,4},
     {gen_component,start,4}]} during handling of 
       {rdht_tx_read_reply,
         {{rdht_req_id,{1,<0.43.0>}},<0.43.0>,v1},
         {rdht_tx_read,"test",timeout,0,-1},
         {rdht_tx_read,"test",{fail,timeout}}}
     in module rdht_tx_write in ({4,3})

And the node completely stops working.

What version of the product are you using? On what operating system?
Latest SVN.

Please provide any additional information below.
It is hard to reproduce this, as it occurs rather seldom. I'm also not sure how 
there can be a timeout with no 
workload and nodes fully connected. But if, then the case {fail,timeout} is not 
handled in 
/src/transactions/rdht_tx_write:my_make_tlog_result_entry/2.

Original issue reported on code.google.com by Uwe.Daue...@gmail.com on 1 Jun 2010 at 9:36

GoogleCodeExporter commented 8 years ago
cs_api_v2 is still experimental. I'll take a look at this next week.
I guess it has also todo with issue 40, which is also not yet solved completely.

Original comment by schin...@gmail.com on 4 Jun 2010 at 9:50

GoogleCodeExporter commented 8 years ago
Ok, thanks. I also tried to pay more attention under which conditions this 
occurs, and 
it seems pretty much only in the beginning of starting up the nodes. Never had 
it ~1 
min after starting all up.

Original comment by Uwe.Daue...@gmail.com on 4 Jun 2010 at 10:24

GoogleCodeExporter commented 8 years ago
Then, please could you check whether the ring is already build by calling

admin:number_of_nodes() and admin:check_ring()

before you start you operation? Does it also occur when all nodes you wanted to 
have
are there and the ring is fine?

Original comment by schin...@gmail.com on 4 Jun 2010 at 11:02

GoogleCodeExporter commented 8 years ago
No, sadly the number is correct and ring size too. Attached is a more detailed 
output 
of what happened, from the perspective of one node (out of four):

I start up 4 nodes in 4 Erlang VM nodes, 1 Scalaris node each.
Then I, four times, check the parameters and write a value:

1. write: ok
2. write: failed
3. write: failed
4. write: ok

After the last fd_pinger received, no successive write fails. So still have the 
assumption it has to do with the bootstrapping time. 

Original comment by Uwe.Daue...@gmail.com on 4 Jun 2010 at 1:01

Attachments:

GoogleCodeExporter commented 8 years ago
- the exception is fixed in r828, but it is still unclear why the timeout is 
triggered

Original comment by schin...@gmail.com on 9 Jun 2010 at 7:58

GoogleCodeExporter commented 8 years ago
Fixed, according to private conversation with Uwe.

Original comment by schu...@gmail.com on 10 Jun 2010 at 12:36