seyyed / scalaris

Automatically exported from code.google.com/p/scalaris
Apache License 2.0
0 stars 0 forks source link

Hang in in increment benchmark #40

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. ./bench_master.sh 4 1 80 8
2. ./bench_slave.sh 4

What is the expected output? What do you see instead?

This should be fairly fast, but most of the time it hangs. Sometimes it is 
quite slow [*], and some 
times it works pretty fast. Bench master running on one server, slave on 
another.

The hang/slowness is always after the first client finishes (the first BS is 
always printed). For 
example:

BS: 76013
BS: 67720004

What version of the product are you using? On what operating system?

Latest. Linux.

Please provide any additional information below.

Relevant config options:

{replication_factor, 2}.
{quorum_factor, 2}.
{boot_host, {{192,168,1,2},14195,boot}}.
{known_hosts, [{{192,168,1,2},14195,service_per_vm}]}.

Original issue reported on code.google.com by natalija...@gmail.com on 26 Apr 2010 at 1:34

GoogleCodeExporter commented 8 years ago
Perhaps it is relevant. I also get some failure messages:

{failure,not_found}

Original comment by natalija...@gmail.com on 26 Apr 2010 at 1:55

GoogleCodeExporter commented 8 years ago
If the client concurrency is increased:

No result for commit received!

Original comment by natalija...@gmail.com on 26 Apr 2010 at 3:06

GoogleCodeExporter commented 8 years ago

You seem to use cs_api_v2: we know these issues and are working on them. It is 
still 
under development. I locally fixed already part of this 'hold' issues, but 
unfortunately not all, yet. I'll not be able to fix it this week, maybe next.

For the time being, please use the cs_api.

Original comment by schin...@gmail.com on 26 Apr 2010 at 7:04

GoogleCodeExporter commented 8 years ago
Happens with either cs_api, or cs_api_v2, or transaction api. Anything but a 
smaller number of clients per VM is 
able to trigger problems with either the increment or read benchmark.

Original comment by natalija...@gmail.com on 26 Apr 2010 at 10:36

GoogleCodeExporter commented 8 years ago
Could you provide us your tests, so we can try to reproduce on our systems?

Original comment by schin...@gmail.com on 26 Apr 2010 at 11:54

GoogleCodeExporter commented 8 years ago
I started with a clean checkout. Steps performed:

01. Added a "-boot_cs first true" to bench_master.sh
02. Fix config options in scalaris.cfg (given above).
03. Run bench as above - hangs
04. Modify src/bench_master.erl to call bench_server:run_increment_v2
05. Run bench as above - slow/hangs.
06. Modify src/bench_master.erl to remove call to increment benchmark.
07. Run bench as above -- works fine.
08. Increase number of iterations, nodes, etc:
09. ./bench_master.sh 42 12 5000 128 -- slow/hangs.
10. Modify code to call _v2 methods for read benchmark.
11. Run benchmark as in step 09. -- slow/hangs.

There are also various spurious errors about conditions not being handled -- in 
the transaction code, in the 
benchmark code, etc. But then I restart the benchmark.

Erlang R13B04
Linux 2.6.33, x86_64.
Machines with a single processor.

Original comment by natalija...@gmail.com on 26 Apr 2010 at 3:58

GoogleCodeExporter commented 8 years ago
BTW, at low concurrency it might require a few tries. With a higher (12) number 
of clients per VM, it is always 
slow or hangs.

Original comment by natalija...@gmail.com on 26 Apr 2010 at 4:05

GoogleCodeExporter commented 8 years ago
FWIW, I tried to reproduce it on another (multi-proc) cluster and I wasn't 
able. I'll try to investigate in detail once 
I have some time.

Original comment by natalija...@gmail.com on 26 Apr 2010 at 9:54

GoogleCodeExporter commented 8 years ago
The only thing i'm certain so far is that bench_master fails to kill the 
bench_slaves when using TCP 
communication, while most of the time it works okay with native communication.

Original comment by natalija...@gmail.com on 26 Apr 2010 at 9:59

GoogleCodeExporter commented 8 years ago
Oops, spoke too soon. ./bench_master.sh 32 12 1000 16 hangs after a few 
not_found in the increment 
benchmark. Stock scalaris, no code changes. Definitely sporadic. Also, every 
once in a while there are some 
spurious errors at start up (boot node), like:

[error] [ PD ] lookup_process failed in Pid <0.206.0>: InstanceID:  
"dht_node_4037574900"  For: 
paxos_proposer StacK: []

[error] [ Node | <0.206.0> ] process paxos_proposer not found: []

or:

[error] Error: exception {function_clause,[{node,id,[unknown]},
                                   {rm_tman,'-merge/3-lc$^0/1-0-',2},
                                   {rm_tman,'-merge/3-lc$^0/1-0-',2},
                                   {rm_tman,merge,3},
                                   {rm_tman,update_view,7},
                                   {rm_tman,on,2},
                                   {gen_component,loop,4},
                                   {gen_component,start,4}]} during handling of {cy_cache,
                                                                                 [{node,
                                                                                   {{192,
                                                                                     168,
                                                                                     17,
                                                                                     213},
                                                                                    14195, ...

Original comment by natalija...@gmail.com on 26 Apr 2010 at 10:22

GoogleCodeExporter commented 8 years ago
The first error (process paxos_proposer not found) can occur during bootup when
processes happen to start in the wrong order. That is easy to fix but I would 
like to
check with Florian first.

The second error could potentially be more serious but I am still looking into 
it.

Original comment by schu...@gmail.com on 27 Apr 2010 at 1:43

GoogleCodeExporter commented 8 years ago
Thanks. Its just a snippet of the error, i can provide the full log if 
necessary. FWIW, i can reproduce it 
somewhat reliable by starting a batch of nodes/servers in a row. Currently I 
have to put a sleep 4s between 
each 64 node creation to let it start.

As for the failure, not_found I think I narrowed it down to meddling with two 
variables. With replication_factor 
set to 3 and quorum_factor to 2, seems to trigger the {failure, not_found} 
errors quite fast. Perhaps this is not 
supported.. dunno, haven't looked at the source code.

Otherwise, I managed to get some benchmarking going on. I even got a unusually 
high (for me) ~4000 trans/s 
in the increment benchmark for a configuration with 13 VMs, 10 clients each and 
770 nodes.

Are there any guidelines on which numbers i should expect? Options i could play 
with?

Original comment by natalija...@gmail.com on 27 Apr 2010 at 11:38

GoogleCodeExporter commented 8 years ago
BTW, during one of the benchmarks a physical server crashed. Some strange 
messages ensued:

=ERROR REPORT==== 27-Apr-2010::18:29:44 ===
** Node 'nodemj067193@mj06.m64cluster.kyivstar' not responding **
** Removing (timedout) connection **
[error] unknown message: {update,rm_tman,stabilizationInterval,<0.112.0>,887} 
in Module: self_man and 
handler on
 in State {{0,nil},{1272,403469,538569}}

[.. variations repeated quite a few times ..]

Original comment by natalija...@gmail.com on 28 Apr 2010 at 12:02

GoogleCodeExporter commented 8 years ago
Things are much more stable now. Thanks! However, a new one popped up:

=ERROR REPORT==== 28-Apr-2010::10:02:29 ===
Supervisor received unexpected message: {lookup_aux,
                                         178784895955716881487194901643901941928,
                                         1,
                                         {rt_get_node,<8207.209.0>,{1,62}}}
[error] unknown message: {lookup_aux,237466544288906589689643825219385656245,1,
                             {rt_get_node,<8207.97.0>,{1,62}}} in Module: rdht_tx_write and handler on
 in State {4,3} 

[error] unknown message: {lookup_aux,97456377977471065738055105113693951744,1,
                             {rt_get_node,<8207.264.0>,{1,62}}} in Module: rdht_tx_read and handler on
 in State {{'$_no_curr_entry',unknown,unknown,0,0,{0,-1},false,false},
           4,3,45100,'"dht_node_1843460355"_rdht_tx_read'} 

Original comment by natalija...@gmail.com on 29 Apr 2010 at 1:46

GoogleCodeExporter commented 8 years ago
Could you please check whether this is still an issue in r917.

Original comment by schin...@gmail.com on 27 Jul 2010 at 8:24

GoogleCodeExporter commented 8 years ago
I can't check as bench_master.sh is now broken.

It invokes bin/scalarisctl with bench_master but there is no such option there.

Original comment by natalija...@gmail.com on 28 Aug 2010 at 5:48

GoogleCodeExporter commented 8 years ago

Original comment by schin...@gmail.com on 1 Sep 2010 at 6:45

GoogleCodeExporter commented 8 years ago
Fixed in r1793. High load does not lead to transaction timeouts anymore.

Original comment by schin...@gmail.com on 17 Jun 2011 at 11:24