Dynamic deadlock error with linear worker_start method

westcode commented 7 years ago

When we try to run the example scenario's (e.g. fan_in) and attempt to set the worker start to a linear method (instead of the poisson) I receive the following error:

09:51:58.603 [error] [ API ] <0.4582.0> Benchmark result: FAILED
Dynamic deadlock detected

09:51:58.618 [error] [ API ] <0.4548.0> Stage 'pipeline - running': failed
Benchmark has failed on running with reason:
{benchmark_failed,{asserts_failed,1}}

Stacktrace: [{mzb_pipeline,error,2,
                           [{file,"/home/test/mzbench/server/_build/default/deps/mzbench_api/src/mzb_pipeline.erl"},
                            {line,90}]},
             {mzb_pipeline,'-handle_cast/2-fun-0-',6,
                           [{file,"/home/test/mzbench/server/_build/default/deps/mzbench_api/src/mzb_pipeline.erl"},
                            {line,172}]}]

When we use the poisson method, the script does execute but it hammers the server when startup up. Do you have an idea why above problem occurs and how we can fix it?

Thanks in advance.

ioolkos commented 7 years ago

Thanks for reporting this, @westcode! Interesting. MZBench seems to have added "dynamic deadlock" detection a couple of weeks ago: https://github.com/machinezone/mzbench/commit/be636c916f06ce630520d8feac4908f53a3dd654

I'll have to look into this and understand it. I'll be able to give a first look in about 8 hours...

westcode commented 7 years ago

Hi @ioolkos, thanks for your quick response! Did you find something? Do you know a older release of mzbench that should work (e.g. 0.3.6)?

ioolkos commented 7 years ago

Hi @westcode jep, I have looked into it & I think the most probable reason for this is that you wait for a signal in your script that you never reach. Have a look at the sizes of your pools, and how you set set_signal and wait_signal accordingly.

Let me know if this doesn't help. Thanks for pointing out the typo! :+1:

westcode commented 7 years ago

Thanks alot for your help, it indeed seems like it has something to do with the wait times in combination with pool size in the scenario. I'm trying to ramp up connections to the broker over time (to test the connection limits) for which we have changed fan_in.bdl to:

#!benchDL

#######
# Scenario:
# A single subscriber reading from "prefix/clients/#" topic filter
# 1k publisher publishing to exclusive topic "prefix/clients/{client_id}"
# Overall Msg rate: 1k msg/s
# Message Size: 150 random bytes
# Runtime: 5 min
#######

make_install(git = "https://github.com/erlio/vmq_mzbench.git",
             branch = "master")

defaults("pool_size" = 1000)

pool(size = 1,
     worker_type = mqtt_worker):

            connect([t(host,"192.168.1.100"),
                    t(port,1883),
            t(username,"User"),
            t(password,"Password"),
                    t(client,"subscriber1"),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            wait(1 sec)
            subscribe("test/clients/#", 0)

pool(size = numvar("pool_size"),
     worker_type = mqtt_worker,
     worker_start = linear(250 rps)):
            connect([t(host,"192.168.1.100"),
                    t(port,1883),
            t(username,"User"),
            t(password, "Password"),
                    t(client,fixed_client_id("pool1", worker_id())),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,4)
                    ])

            set_signal(connect1, 1)
            wait_signal(connect1, numvar("pool_size"))
            wait(5 sec)
            loop(time = 5 min, rate = 1 rpm):
                publish_to_self("test/clients/", random_binary(150), 0)
            disconnect()

The script runs propertly (with 1000 as poolsize), the results do not match our expectations (nor can we increase the poolsize higher):

The connection graph shows that full pool size is connected after +-18 seconds (using different ramp times of 250 rps and 125 rps)? Additionally, no ramp is shown in the graphs.
When we increase the wait(5 sec) to e.g. 45 seconds, and try with a poolsize of 5000 we receive the dynamic deadlock again.

Do you have an idea why this happends? Or can you maybe suggest a better scenario for testing connection limits?

Much thanks in advance.

ioolkos commented 7 years ago

Hi, what are you actually testing/optimizing for? number of concurrent connections, or connection SETUP rate? If it's the latter, do you have a specific number you want to reach?

westcode commented 7 years ago

Number of concurrent connections at first (scaling using a load balancer)

Op ma 6 feb. 2017 om 15:13 schreef ioolkos notifications@github.com

Hi, what are you actually testing/optimizing for? number of concurrent connections, or connection SETUP rate? If it's the latter, do you have a specific number you want to reach?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/erlio/vmq_mzbench/issues/5#issuecomment-277692827, or mute the thread https://github.com/notifications/unsubscribe-auth/AFlUlZISu1YIwE6hzLhYy3Ym8Mb9GsWyks5rZyqmgaJpZM4LycOV .

ioolkos commented 7 years ago

Alright, I'm going to try to find out why the signal deadlock here. In the mean time, you could delete the set_signal and wait_signal functions in your script and replace them with some wait(25 sec) or something appropriate manually. This should work in any case.

ioolkos commented 7 years ago

Hi @westcode I found the following workaround: I can use signals without problems when I wait before sending the signals. The wait time depends on metric_update_interval_ms which is a MZBench server config value normally set to 10 seconds.

            wait(10 sec)
            set_signal("connect1",1)
            wait_signal("connect1", 5000)

If I set this wait time, I don't see dynamic deadlocking anymore.

westcode commented 7 years ago

Hi @ioolkos, Thanks for this 🥇, we've successfully added this back into the fan_in script and it is working as expected with the signals.

Regarding the test results another question :), we run the fan_in test with 25000 on 2 mzbench clients connecting to 1 broker. This amounts to 50k connections and the test succeeds. When we up the poolsize to 30k per mzbench client, the total connections stagger at around 28.2k on both clients (28222/28213). The file limits have been upped to 65535 on both clients. Do you have an idea why we cannot go past this limit?

Thanks again 👍

ioolkos commented 7 years ago

Quick shot in the dark: the Linux default ephemereal port range looks suspiciously close to your values: 61000 - 32768 = 28232. (you can configure that of course)

westcode commented 7 years ago

@ioolkos thanks for the tip this was indeed the problem :). To conclude we would like to test with both server and client certificates (self signed). Is this possible using vmq_mzbench?

ioolkos commented 7 years ago

Good question. The problem is how to generate client certificates. What I did a while ago was to use the same client cert for all the workers.

There is no obvious way to load the certs dynamically into the client workers. I did it like this (with the include_resource statement). I can then load those resources by the resource statement while setting up the transport.

include_resource(cacertsfile1, "erlang_client_ca.erl", erlang)
include_resource(certfile1, "erlang_client_dercert.erl", erlang)
include_resource(keyfile1, "erlang_client_derkey.erl", erlang)

pool(size = 50000,
     worker_type = mqtt_worker,
     worker_start = poisson(1000 rps)):

            connect([t(host, "127.0.0.1"),
                    t(port,1883),
                    t(client,fixed_client_id("pool1", worker_id())),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,15), t(transport, t(ssl, [t(reuse_sessions, false),t(cacerts, resource(cacertsfile1)),t(cert,resource(certfile1)),t(key, resource(keyfile1))]))
                    ])
            wait(20 sec)
            subscribe("prefix/clients/fix", 0)
            set_signal(connect1, 1)
            wait_signal(connect1, 50000)

Now there is one big problem here: I had to convert the cert file to some erlang format, so that the worker can load them. It didn't work by loading just the binaries. And I have to remember how I did that now.... :)

I'm not sure you want to try that yourself. But I'm going to have a look if there is some better possibility to do MZBench benches with certs.

westcode commented 7 years ago

Hi @ioolkos! Any new thoughts on this? 👍

ioolkos commented 7 years ago

Apologies @westcode for not following quickly through with this.

Currently, there's still only the method to manually transform the certs to erlang-loadable files. @gdhgdhgdh has successfully verified that this works also with one's own certs, I think.

If you want I can send you our test certs transformed in this way and guide you through the process.

Other than that I plan to see if I can implement this as MZBench statement functions that could load the certs. As soon as I find time, let's say.

westcode commented 7 years ago

Sounds good! 👍 Thanks alot.

whyameye commented 7 years ago

I would also appreciate a look-see of the test certs and a guide for how to transform my own self-signed certs to be erlang-loadable.

ioolkos commented 7 years ago

OK, @whyameye thanks for the reminder & I'll try to pack this up tomorrow and send it to you.

ioolkos commented 7 years ago

OK, I might as well document this here.

To use a client cert in a vmq_mzbench scenario you have to currently transform the certs to Erlang terms. Fire up an Erlang shell by typing erl and do the following prodecure for any of your certs (the ca, cert and key files):

{ok, File} = file:read_file('/home/afa/BENCHMARKS/client.crt').
CC = public_key:pem_decode(File).
rp(CC).

From the Console output, copy the needed parts. Compare to my 3 example files in the gist below, on how they should look. https://gist.github.com/ioolkos/1e6e0107b961caf910a0deb61a7e4a23

Save your 3 generated files in the same directory as your MZBench test script. Make sure you have a period (.) sign at the end of each of those files.

Run your MZBench test script with the command: mzbench run your_script.bdl

To use the Certs in your script, do something like the following:

include_resource(cacertsfile, "erlang_ca.erl", erlang)
include_resource(certfile, "erlang_client_cert.erl", erlang)
include_resource(keyfile, "erlang_client_key.erl", erlang)

pool(size = 50000,
     worker_type = mqtt_worker,
     worker_start = poisson(1000 rps)):

            connect([t(host, "127.0.0.1"),
                    t(port,1883),
                    t(client,fixed_client_id("pool1", worker_id())),
                    t(clean_session,true),
                    t(keepalive_interval,60),
                    t(proto_version,4), t(reconnect_timeout,15), t(transport, t(ssl, [t(reuse_sessions, false),t(cacerts, resource(cacertsfile)),t(cert,resource(certfile)),t(key, resource(keyfile))]))
                    ])
            wait(20 sec)
            subscribe("prefix/clients/fix", 0)
            set_signal(connect1, 1)
            wait_signal(connect1, 50000)

Note: This will use the same Client cert for all your MZBench client connections! The purpose of this is just to have an approximation to a real world case, to get an idea on performance etc.

whyameye commented 7 years ago

Thank you for this. Do you know if there is a way I can set insecure mode (no domain name checking) and still use certs the way you describe?

ioolkos commented 7 years ago

Oh man, realising TLS is not exactly my forte :) You mean the clients not checking server hostname, or the other way around? I'm gonna have to look what kind of verify option or verify fun this could be in Erlang TLS...

vernemq / vmq_mzbench

Dynamic deadlock error with linear worker_start method #5