Closed GoogleCodeExporter closed 8 years ago
Perhaps it is relevant. I also get some failure messages:
{failure,not_found}
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 1:55
If the client concurrency is increased:
No result for commit received!
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 3:06
You seem to use cs_api_v2: we know these issues and are working on them. It is
still
under development. I locally fixed already part of this 'hold' issues, but
unfortunately not all, yet. I'll not be able to fix it this week, maybe next.
For the time being, please use the cs_api.
Original comment by schin...@gmail.com
on 26 Apr 2010 at 7:04
Happens with either cs_api, or cs_api_v2, or transaction api. Anything but a
smaller number of clients per VM is
able to trigger problems with either the increment or read benchmark.
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 10:36
Could you provide us your tests, so we can try to reproduce on our systems?
Original comment by schin...@gmail.com
on 26 Apr 2010 at 11:54
I started with a clean checkout. Steps performed:
01. Added a "-boot_cs first true" to bench_master.sh
02. Fix config options in scalaris.cfg (given above).
03. Run bench as above - hangs
04. Modify src/bench_master.erl to call bench_server:run_increment_v2
05. Run bench as above - slow/hangs.
06. Modify src/bench_master.erl to remove call to increment benchmark.
07. Run bench as above -- works fine.
08. Increase number of iterations, nodes, etc:
09. ./bench_master.sh 42 12 5000 128 -- slow/hangs.
10. Modify code to call _v2 methods for read benchmark.
11. Run benchmark as in step 09. -- slow/hangs.
There are also various spurious errors about conditions not being handled -- in
the transaction code, in the
benchmark code, etc. But then I restart the benchmark.
Erlang R13B04
Linux 2.6.33, x86_64.
Machines with a single processor.
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 3:58
BTW, at low concurrency it might require a few tries. With a higher (12) number
of clients per VM, it is always
slow or hangs.
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 4:05
FWIW, I tried to reproduce it on another (multi-proc) cluster and I wasn't
able. I'll try to investigate in detail once
I have some time.
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 9:54
The only thing i'm certain so far is that bench_master fails to kill the
bench_slaves when using TCP
communication, while most of the time it works okay with native communication.
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 9:59
Oops, spoke too soon. ./bench_master.sh 32 12 1000 16 hangs after a few
not_found in the increment
benchmark. Stock scalaris, no code changes. Definitely sporadic. Also, every
once in a while there are some
spurious errors at start up (boot node), like:
[error] [ PD ] lookup_process failed in Pid <0.206.0>: InstanceID:
"dht_node_4037574900" For:
paxos_proposer StacK: []
[error] [ Node | <0.206.0> ] process paxos_proposer not found: []
or:
[error] Error: exception {function_clause,[{node,id,[unknown]},
{rm_tman,'-merge/3-lc$^0/1-0-',2},
{rm_tman,'-merge/3-lc$^0/1-0-',2},
{rm_tman,merge,3},
{rm_tman,update_view,7},
{rm_tman,on,2},
{gen_component,loop,4},
{gen_component,start,4}]} during handling of {cy_cache,
[{node,
{{192,
168,
17,
213},
14195, ...
Original comment by natalija...@gmail.com
on 26 Apr 2010 at 10:22
The first error (process paxos_proposer not found) can occur during bootup when
processes happen to start in the wrong order. That is easy to fix but I would
like to
check with Florian first.
The second error could potentially be more serious but I am still looking into
it.
Original comment by schu...@gmail.com
on 27 Apr 2010 at 1:43
Thanks. Its just a snippet of the error, i can provide the full log if
necessary. FWIW, i can reproduce it
somewhat reliable by starting a batch of nodes/servers in a row. Currently I
have to put a sleep 4s between
each 64 node creation to let it start.
As for the failure, not_found I think I narrowed it down to meddling with two
variables. With replication_factor
set to 3 and quorum_factor to 2, seems to trigger the {failure, not_found}
errors quite fast. Perhaps this is not
supported.. dunno, haven't looked at the source code.
Otherwise, I managed to get some benchmarking going on. I even got a unusually
high (for me) ~4000 trans/s
in the increment benchmark for a configuration with 13 VMs, 10 clients each and
770 nodes.
Are there any guidelines on which numbers i should expect? Options i could play
with?
Original comment by natalija...@gmail.com
on 27 Apr 2010 at 11:38
BTW, during one of the benchmarks a physical server crashed. Some strange
messages ensued:
=ERROR REPORT==== 27-Apr-2010::18:29:44 ===
** Node 'nodemj067193@mj06.m64cluster.kyivstar' not responding **
** Removing (timedout) connection **
[error] unknown message: {update,rm_tman,stabilizationInterval,<0.112.0>,887}
in Module: self_man and
handler on
in State {{0,nil},{1272,403469,538569}}
[.. variations repeated quite a few times ..]
Original comment by natalija...@gmail.com
on 28 Apr 2010 at 12:02
Things are much more stable now. Thanks! However, a new one popped up:
=ERROR REPORT==== 28-Apr-2010::10:02:29 ===
Supervisor received unexpected message: {lookup_aux,
178784895955716881487194901643901941928,
1,
{rt_get_node,<8207.209.0>,{1,62}}}
[error] unknown message: {lookup_aux,237466544288906589689643825219385656245,1,
{rt_get_node,<8207.97.0>,{1,62}}} in Module: rdht_tx_write and handler on
in State {4,3}
[error] unknown message: {lookup_aux,97456377977471065738055105113693951744,1,
{rt_get_node,<8207.264.0>,{1,62}}} in Module: rdht_tx_read and handler on
in State {{'$_no_curr_entry',unknown,unknown,0,0,{0,-1},false,false},
4,3,45100,'"dht_node_1843460355"_rdht_tx_read'}
Original comment by natalija...@gmail.com
on 29 Apr 2010 at 1:46
Could you please check whether this is still an issue in r917.
Original comment by schin...@gmail.com
on 27 Jul 2010 at 8:24
I can't check as bench_master.sh is now broken.
It invokes bin/scalarisctl with bench_master but there is no such option there.
Original comment by natalija...@gmail.com
on 28 Aug 2010 at 5:48
Original comment by schin...@gmail.com
on 1 Sep 2010 at 6:45
Fixed in r1793. High load does not lead to transaction timeouts anymore.
Original comment by schin...@gmail.com
on 17 Jun 2011 at 11:24
Original issue reported on code.google.com by
natalija...@gmail.com
on 26 Apr 2010 at 1:34