vernemq / vmq_mzbench

An MQTT loadtest and usage scenario tool for VerneMQ and other MQTT systems.
Apache License 2.0
42 stars 45 forks source link

gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} #11

Closed KrlosWd closed 4 years ago

KrlosWd commented 7 years ago

Hello,

I'm trying to do some benchmarking with multiple servers running mzbench + vmq_mzbench. However, whenever I reach over 35k publishers (distributed in 14 nodes) I start getting some errors related to timeouts as shown next:

12:42:29.004 [error] <0.196.0> gen_server mzb_time terminated with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:call/2 line 204
12:42:31.532 [error] <0.196.0> CRASH REPORT Process mzb_time with 0 neighbours exited with reason: {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in gen_server:terminate/7 line 826
12:42:31.540 [error] <0.132.0> Supervisor mzb_sup had child time_service started with mzb_time:start_link() at <0.196.0> exit with reason {timeout,{gen_server,call,[mzb_interconnect,get_director]}} in context child_terminated

Anyone has any idea of what could be causing this?

This is the scenario I''m trying to run:

#!benchDL

make_install(git = "https://github.com/erlio/vmq_mzbench.git",
             branch = "master")

defaults("topic" = "topic1", 
            "subtopic" = "topic1",
            "sub_host" = "192.168.144.11", 
            "pub_host" = "192.168.144.11",
            "pubs"     = 1000, 
            "subname"  = "subscriber1",
            "poolname" = "pool1")

pool(size = 1,
     worker_type = mqtt_worker):

            connect([t(host, var("sub_host")),
                    t(port,1883),
                    t(client,var("subname")),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            wait(1 sec)
            subscribe(var("subtopic"), 0)

pool(size = numvar("pubs"),
     worker_type = mqtt_worker,
     worker_start = linear(40 rps)):

            connect([t(host, var("pub_host")),
                    t(port,1883),
                    t(client,fixed_client_id(var("poolname"), worker_id())),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            wait(15 sec)
            set_signal("connect1",1)
            wait_signal("connect1", numvar("pubs"))
            loop(time = 10 min, rate = 1 rps):
                publish(var("topic"), random_binary(150), 0)
            disconnect()

This is scenario is executed independently by each node (14 nodes in total), each node has its own topic,, they all publish/subscribe to the same server and the error occurs in at least one node after reaching 35k publishers, meaning 2.5k publishers per node.

Thanks in advance for your help, Best,

Carlos

ioolkos commented 7 years ago

Hi @KrlosWd thanks for asking... I have seen those kind of errors. I can't give you a single reason why.

A couple of observations:

KrlosWd commented 7 years ago

Hi @ioolkos, thanks for your quick answer, As for your observations, it is worth to mention that I'm benchmarking the open source MQTT broker named mosquitto since I implemented some changes for some experiments I'm conducting for a research project, the problem with mosquitto is that it is single threaded. So having that in mind:

14:15:06.228 [error] emulator Error in process <0.10270.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:15:51.412 [error] emulator Error in process <0.10277.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:16:18.462 [error] emulator Error in process <0.10280.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}
14:17:09.136 [error] emulator Error in process <0.10283.0> on node 'mzb_director1_0@127.0.0.1' with exit value:
{{badmatch,{error,timeout}},[{cpu_sup,measurement_server_init,0,[{file,"cpu_sup.erl"},{line,497}]}]}

Since mosquitto is single threaded, the number of connections it can accept in one second is pretty limited. With 40 rps per node I have a total of 560 rps, however I think I could go higher than that, but I'm using this number as a safe rate in the mean time.

Each node uses a different topic, so I actually have 14 queues but I'm open to suggestions :D

ioolkos commented 7 years ago

Thanks for your details @KrlosWd ! Keep us posted on your testing progress and any results with Mosquitto (which of course is incredibly powerful on 1 core)