satori-com / mzbench

MZ Benchmarking
BSD 3-Clause "New" or "Revised" License
269 stars 78 forks source link

Benchmark result: EXCEPTION director_connection_down #143

Open vedavidhbudimuri opened 6 years ago

vedavidhbudimuri commented 6 years ago

I'm using mzbench to stress test an MQTT broker. It worked well in the beginning, but suddenly I'm getting the below error. I tried running the old benchmark which worked well earlier, even they started showing the same error.

SystemConfig 8 Core 32G system Ubuntu

Error: 04:46:52.530 [error] [Undefined] <0.218.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 206 04:47:03.892 [error] [Undefined] emulator Error in process <0.20270.0> on node 'mzb_director94_0@127.0.0.1' with exit value: {{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,498}]}]} 04:47:13.342 [error] [Undefined] <0.218.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 206 04:47:14.884 [error] [Undefined] <0.154.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.218.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated 04:46:52.530 [error] [Undefined] <0.218.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 206 04:47:03.892 [error] [Undefined] emulator Error in process <0.20270.0> on node 'mzb_director94_0@127.0.0.1' with exit value: {{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,498}]}]} 04:47:13.342 [error] [Undefined] <0.218.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 206 04:47:14.884 [error] [Undefined] <0.154.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.218.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated

parsifal-47 commented 6 years ago

Hello, this could happen because the node is failed to start, if you use "static" host list you may check this by going to worker node and checking "beam.smp" process for presence.

If it has failed to stop older node, it could affect all benchmarks after. Sometimes it happens because of too high load.

Please let me know if killing "beam.smp" solves the problem

vedavidhbudimuri commented 6 years ago

Yeah, I could find more than one beam.smp processes so I kill all of them. And restart the mzbench server but im still facing the same issue.

parsifal-47 commented 6 years ago

do you run workers and API server on the same host? (do you use one host for everything) second thing, "beam.smp" could be automatically restarted with "heart", you need to make sure that it dies completely or consider killing "heart" as well

vedavidhbudimuri commented 6 years ago

@parsifal-47 sorry, I noticed this heart part later. I rebooted the system to kill the heart completely and continued my load testing. It worked for a while but after some time Im still getting this issue.

Yeah Both my API Server and host are on the same system

parsifal-47 commented 6 years ago

looks like it fails to stop the node for some reason, killing "beam.smp" and "heart" is not how it should work normally, I'll think on some other suggestions

parsifal-47 commented 6 years ago

as far as I understood the situation: after some number of benchmarks it fails to stop the node and you cannot start any more.

is overall system utilization high? I could suggest one experiment: try to run some number of benchmarks with less load, if it won't be reproduced then high load is the reason.

Please let me know the result in any case, Thanks!

vedavidhbudimuri commented 6 years ago

Hey @parsifal-47, I have checked the CPU Utilization which is not 100% for sure and memory its not even consuming 30%.

Is there any other things we can check? and come to some conclusion.

parsifal-47 commented 6 years ago

Oh, I just realized that you developed your own plugin, I'll check the code, is it here: https://github.com/vedavidhbudimuri/emq_custom_plugin ?

vedavidhbudimuri commented 6 years ago

@parsifal-47 Yeah tried some custom plugin but it is for the mqtt-broker and anyway it is in disabled mode. Im using https://github.com/erlio/vmq_mzbench/ this plugin for mzbench

parsifal-47 commented 6 years ago

Great! This one is well-known, shouldn't bring a problem per se. Let me know if the scenario is also available online, I'll try to repeat your issue.

I have another idea it could be the following: Server could be crashing before stopping the node, have no idea why, but it is possible, to check that please check your server logs at /opt/mzbench_api/log/error.log or <your_mzbench_dir>/server/log/error.log it could be some crash info there

also, on the dashboard you should see disconnect (red) and connect (green) messages because websocket is closed and reopened in this case