BraulioV commented 3 years ago

This scenario will test how the cluster performs when there're a big number of agents connected to them. The objective is to get synchronization times, discover bottlenecks, and use this as a baseline to test the performance of the distributed API and its endpoints.

Objectives

[x] Deploy a wazuh-cluster of n nodes automatically.
[x] Deploy a load balancer for this cluster.
[x] Extract hardware metrics for the cluster and API.
[x] Run the test with these number of agents.
- [x] 1000.
- [x] 25000.
- [x] 50000.

jmv74211 commented 3 years ago

In this second iteration, we are going to improve the test cluster pipeline to allow deployment, configuration ... of a cluster of N nodes instead of a single manager scenario.

To do this, I estimate the following tasks:

[x] 1: Update Jenkins UI parameters: manager workers number, local internal options configuration ....
[x] 2: Parallel deployment of N additional manager worker instances.
[x] 3: Provisioning of all cluster nodes. Master and workers.
[x] 4: Cluster configuration.
[x] 5: Collect cluster log and store as artifact.
[x] 6: Configure the cluster nodes with custom local internal options.
[x] 7: Add TCP-UDP configuration to the cluster of wazuh-managers.
[x] 8: Update the agent simulator to allow registration and connection on different nodes. (Registration in master and connection with the worker).
[x] 9: Add new directory structure to classify plots and stats data.
[x] 10: Collect data, graphs... from all managers, classify them by instance and store them as artifacts.
[x] 11: Extra: Fix cluster configuration problem in ossec.conf.
[x] 12: Investigate the AWS Load Balancing service and use it for load balancing between simulated agents and workers.
[ ] 13: Investigate with the framework team how we can obtain global cluster metrics => This will be done in the next iteration.
- [ ] 13.1: Parse of /var/ossec/bin/cluster_control -i more ??
- [ ] 13.2: Parse of cluster logs with debug ??
[x] 14: Add a new parameter to select the type of instance to be deployed

jmv74211 commented 3 years ago

Task 1

I have added the following parameters to the Jenkins UI:

MANAGER_WORKERS

It is used to specify the number of worker nodes to be deployed.

LOCAL_INTERNAL_OPTIONS_CONFIG

Used to specify the custom settings to be applied in the local_internal_options.conf file. This will be applied to all cluster nodes.

Here you can see the changes involved in the pipeline for this task https://github.com/wazuh/wazuh-jenkins/commit/5a5233044728066f169a7297ab074462e0d20054

jmv74211 commented 3 years ago

Task 2

The total number of managers will be 1 + manager_workers_num.

Since I had already prepared the parallel deployment of N instances for the managers, the only change required for this task was as follows https://github.com/wazuh/wazuh-jenkins/commit/c68c1e4b1465bf0eb8e36974aefcdb487e8ab4b4

jmv74211 commented 3 years ago

Task 3

Provisioning is already done in parallel because all instances belong to the same host group in the Ansible inventory. By default the following is done for all nodes:

Package download according to the parameters entered in the UI (RPM only).
Start and enable the wazuh-manager service.
Cloning of the QA repository specified in the UI parameters.
Installation of QA dependencies.

jmv74211 commented 3 years ago

Task 4

The cluster configuration has been done by adding a new <ossec_config> block at the end of the ossec.conf file with the specific configuration of each cluster node (https://github.com/wazuh/wazuh-jenkins/blob/87efef121fc9adc64ee4d3de7437081b6609e46a/jenkins-files%2Ftests%2Fperformance%2Ftest_cluster.groovy#L496-L510).

In this environment, there is a master node and the rest will be worker nodes.

The changes made are as follows: https://github.com/wazuh/wazuh-jenkins/commit/87efef121fc9adc64ee4d3de7437081b6609e46a

For example, the configuration of a master with two workers would be as follows:

NAME                                            TYPE    VERSION  ADDRESS        
master                                          master  4.2.0    172.31.14.13   
Test_cluster_performance_sprint2_B10_manager_1  worker  4.2.0    172.31.3.193   
Test_cluster_performance_sprint2_B10_manager_2  worker  4.2.0    172.31.15.234

jmv74211 commented 3 years ago

Task 5 and 10

Due to the dependence of doing task 10 to perform 5, the two have been done together.

The directory and file structure for each of the instances (master and worker nodes) has already been organized.

At the end you get a tar.gz file that contains a directory for each instance, and each one has stored data, logs and graphics.

The changes made are as follows: https://github.com/wazuh/wazuh-jenkins/commit/85c7ab88568b7d84ea1807df3cddcfa54cb25a65

jmv74211 commented 3 years ago

Task 6

The content of the custom local_internal_options file entered from the Jenkins UI has been added to the cluster provisioning, so each node will have the specified configuration.

The changes made are as follows: https://github.com/wazuh/wazuh-jenkins/commit/23cb4968744eb2478fa49f531812844126cbc905

jmv74211 commented 3 years ago

Task 7

It is now possible to select the type of protocol to be used for communication with the cluster. This configuration is performed during the provisioning and configuration of all nodes.

The changes made are as follows: https://github.com/wazuh/wazuh-jenkins/commit/45850570c1c14a3ff696d24bea44ff8e1fb66d4b

jmv74211 commented 3 years ago

Task 8

Testing the changes made in the agent simulator so that the agent can register on the master node and then connect to a worker, we have discovered that the parser in the cluster configuration does not work as expected. All the details are indicated in this issue https://github.com/wazuh/wazuh/issues/8229.

These agent simulator changes have been merged into the wazuh-qa repository with the following PR https://github.com/wazuh/wazuh-qa/pull/1233

jmv74211 commented 3 years ago

Task 9

The statistics and their graphs have been grouped by demons. This makes it much easier to find the information you want.

The structure is as follows:

├── data
│   ├── binaries
│   │   ├── ...
│   └── stats
│       ├── analysisd
│       │   └── wazuh-analysisd_stats.csv
│       ├── logcollectord
│       │   ├── CSV 1
│       │   ├── CSV 2
│       │   ├── ...
│       └── remoted
│           └── wazuh-remoted_stats.csv
├── logs
│   ├── ...
└── plots
    ├── binaries
    │   ├── ...
    └── stats
        ├── analysisd
        │   ├── SVG 1
        │   ├── SVG 2
        │   ├── ...
        ├── logcollectord
        │   ├── SVG 1
        │   ├── SVG 2
        │   ├── ...
        └── remoted
        │   ├── SVG 1
        │   ├── SVG 2
        │   ├── ...

jmv74211 commented 3 years ago

Task 11

To solve the conflict error when parsing the cluster blocks in the ossec.conf (mentioned in the previous task), I have added a task before adding the cluster configuration to remove the block corresponding to the cluster from the ossec.conf.

See the changes here https://github.com/wazuh/wazuh-jenkins/commit/3d840a75f5bd2290ebc15d342fea4ac8de6018a5

jmv74211 commented 3 years ago

Task 12

I've been researching AWS load balancing service.

In order to use this service from our Jenkins pipeline, it will be necessary to create a series of additional modules to be able to apply the necessary calls using AWS cli, which allows us to:

Create the load balancer and define a set of listeners: https://docs.aws.amazon.com/cli/latest/reference/elb/create-load-balancer.html#create-load-balancer
Assign the corresponding instances to our load balancer: https://docs.aws.amazon.com/cli/latest/reference/elb/create-load-balancer.html#create-load-balancer
Remove load balancer after pipeline execution completes: https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-deregister-register-instances.html

I am going to start the development of this new module, as I progress, I will comment on it.

jmv74211 commented 3 years ago

I have already finished the development of the load balancer. I had to make some additional adjustments to the agent simulator and the script because I found some bugs during this development.

As a result, we have that an on demand load balancer is created in AWS as long as the number of worker nodes is greater than 0. In this load balancer all the worker nodes will be registered and all the agent connections will be distributed to these nodes, as we can see in the following example:

[root@ip-172-31-15-151 ec2-user]# /var/ossec/bin/cluster_control -a
ID   NAME                           IP         STATUS  VERSION       NODE NAME                                       
000  ip-172-31-15-151.ec2.internal  127.0.0.1  active  Wazuh v4.2.0  master                                          
001  1-db6873f1-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_1  
002  1-26cf3108-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_2  
003  1-ed742f0a-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_3  
004  1-0b8047a6-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_1  
005  1-42d49a73-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_1  
006  1-5ebd8747-debian10            10.0.2.15  active  Wazuh 4.2.0   Test_cluster_performance_sprint2_B89_manager_3

At the end of the test, this load balancer is destroyed like the rest of the AWS instances.

The changes made are as follows https://github.com/wazuh/wazuh-jenkins/commit/d8be61449829ee7bc6b3431143fab7837ddb31de

jmv74211 commented 3 years ago

Task 14

I have added two new parameters to the pipeline to select the type of instances to deploy to agents and managers.

Also, I have updated the deployment logic to the following:

If the testing time exceeds the threshold (45 min), c5xlarge is selected for both agent and manager.
auto mode will always choose t2.medium unless the test time does not exceed the threshold, or the user has not explicitly selected c5xlarge.
If the user chooses c5xlarge, it will be selected
If the test mode is AGENTS, then the type of instances used to deploy agents will be c5xlarge.

The changes made are as follows https://github.com/wazuh/wazuh-jenkins/commit/a5fac3783b5a2b91392bed6b26cc9eaa2d3ebeaa

BraulioV commented 3 years ago

Closed by https://github.com/wazuh/wazuh-jenkins/pull/2518

wazuh / wazuh-qa

Performance tests: cluster of N nodes scenario #1139

Objectives

Task 1

Task 2

Task 3

Task 4

Task 5 and 10

Task 6

Task 7

Task 8

Task 9

Task 11

Task 12

Task 14