satori-com / mzbench

MZ Benchmarking
BSD 3-Clause "New" or "Revised" License
271 stars 78 forks source link

mzb_staticcloud_plugin doesn't work properly with more than 1 machines. #129

Open loguntsov opened 6 years ago

loguntsov commented 6 years ago

When i had olny one machine for tests in my config file:

{cloud_plugins,[
    {static,#{
        hosts => [
            "root@172.19.214.212"
        ],
        module => mzb_staticcloud_plugin
    }}
 ]},

my tests worked fine on one machine.

When i added one more:

{cloud_plugins,[
    {static,#{
        hosts => [
            "root@172.19.214.212", 
            "root@172.19.214.211" 
        ],
        module => mzb_staticcloud_plugin
    }}
 ]},

my test was failed with reason

6:04:15.737 [info] [ API ] <0.426.0> Allocating 3 hosts in static cloud...
16:05:15.828 [error] [ API ] <0.418.0> Stage 'pipeline - allocating_hosts': failed
Benchmark has failed on allocating_hosts with reason:
hosts_are_busy

Stacktrace: [{mzb_staticcloud_plugin,create_cluster,3,
                 [{file,
                      "/home/begemot/erl/oncam/load_test/mzbench_oncam/lib/mzbench/server/_build/default/deps/mzbench_api/src/mzb_staticcloud_plugin.erl"},
                  {line,31}]},
             {mzb_api_cloud,'-handle_call/3-fun-0-',7,
                 [{file,
                      "/home/begemot/erl/oncam/load_test/mzbench_oncam/lib/mzbench/server/_build/default/deps/mzbench_api/src/mzb_api_cloud.erl"},
                  {line,87}]},
             {mzb_api_bench,allocate_hosts,2,
                 [{file,
                      "/home/begemot/erl/oncam/load_test/mzbench_oncam/lib/mzbench/server/_build/default/deps/mzbench_api/src/mzb_api_bench.erl"},
                  {line,738}]},
             {mzb_api_bench,handle_stage,3,
                 [{file,
                      "/home/begemot/erl/oncam/load_test/mzbench_oncam/lib/mzbench/server/_build/default/deps/mzbench_api/src/mzb_api_bench.erl"},
                  {line,224}]},
             {mzb_pipeline,'-handle_cast/2-fun-0-',6,
                 [{file,
                      "/home/begemot/erl/oncam/load_test/mzbench_oncam/lib/mzbench/server/_build/default/deps/mzbench_api/src/mzb_pipeline.erl"},
                  {line,172}]}]

I found these line:

https://github.com/satori-com/mzbench/blob/6fc463df6aa87117c1f223f67ae7ec66ddd4802f/server/src/mzb_staticcloud_plugin.erl#L46

so it means if this plugin has only one machine in list, it will work always even N ( number of requested machines) will be more than 1.

when number of machines will be more than 1, will work next clause: https://github.com/satori-com/mzbench/blob/6fc463df6aa87117c1f223f67ae7ec66ddd4802f/server/src/mzb_staticcloud_plugin.erl#L49

which will send actual result only after when N will be less or equal count of servers. BUT!!!

I payed attention of this line from log:

6:04:15.737 [info] [ API ] <0.426.0> Allocating 3 hosts in static cloud...

Why i need allocating 3 hosts from 2 machines. How is it possible ?

so i found this line: https://github.com/satori-com/mzbench/blob/6fc463df6aa87117c1f223f67ae7ec66ddd4802f/server/src/mzb_api_bench.erl#L737-L738

Why there is N+1 ? i don't understand it.

So i applied this patch:

--- a/server/src/mzb_api_bench.erl
+++ b/server/src/mzb_api_bench.erl
@@ -734,8 +734,8 @@ allocate_hosts(#{nodes_arg:= N, cloud:= Cloud} = Config, Logger) when is_integer
         description => Description
     },
     % Allocate one supplementary node for the director
-    Logger(info, "Allocating ~p hosts in ~p cloud...", [N + 1, Cloud]),
-    {ok, ClusterId, UserName, Hosts} = mzb_api_cloud:create_cluster(BenchId, Cloud, N + 1, ClusterConfig),
+    Logger(info, "Allocating ~p hosts in ~p cloud...", [N, Cloud]),
+    {ok, ClusterId, UserName, Hosts} = mzb_api_cloud:create_cluster(BenchId, Cloud, N, ClusterConfig),
     Deallocator =
         fun () ->
             mzb_api_cloud:destroy_cluster(ClusterId)

and it work right now. but one machine has 100% load, second machine has 0% load but i see worked mzbench instance.

There is some issue. Could you see on you side and confirm what is wrong ? Thanks.

parsifal-47 commented 6 years ago

Hi, sorry for long response, it allocates N+1 machines to have "director" node separated. We have one node doing only aggregation because when it is done by ordinary "worker" node, aggregation is affected by high CPU load any you may see distorted data.

When you have only one host we made this trick to represent it as any number of hosts because you won't be able to run anything when you don't have a cluster. So this mode 1 = any is made for local running.

Hope it answers you questions, feel free to ask more

loguntsov commented 6 years ago

hi. as i understand, it depends from --nodes=.. param of command line ? What the default value for this param ? i hope number of machines described in config - 1 ? Where i can find this expression ?

I think there is some misunderstands. The end user should worry about count of machines, but you told about count of nodes. End user doesn't know about any directors and worker. He knows about allocated machines for tests. And it means it is more easy to provide count of machines in --nodes param, than count of nodes (count machines - 1). Because it is not obvious. And if you have only one machine for tests, you can assign any value for --nodes param. It can cause misunderstanding. MZBench should fire exception for this case, or warning like this "worker and director on same node (machine)"

parsifal-47 commented 6 years ago

The default is "1", but it means different things depending on if you have only one host or multiple hosts available. In case of single host available it will put director and worker at the same node, in case of multiple nodes -- it will try to separate them.

The default is here: https://github.com/satori-com/mzbench/blob/master/server/src/mzb_api_endpoints.erl#L513

Client sometimes have no idea about what happening on server, it could also be multiple clients and in our case it was usually AWS EC2 installation with almost unlimited number of available nodes, so having "1" by default was the most logical.

I understand that it could be more clear, please feel free to comment!

loguntsov commented 6 years ago

See.

You have only 1 machine described in config (via static plugin). You assigned --nodes=5, and it means MZBench will works with only 1 machine without any warning and notifications.

Another case.

You have only 2 machines described in config. You assigned --nodes=5, and this situation will fire timeout, because MZBecn can't realocate 5 nodes.

And it is logical misunderstanding. The behaviour should be same every time

parsifal-47 commented 6 years ago

I agree that we need to describe it in more explicit way and also put some warnings on allocation