Manual testing: API request to balance Agents (restart TCP session)

jmv74211 commented 2 years ago

We want to perform a manual test on the development related to API request to balance Agents (restart TCP session).

The objective is to test the correct functioning and behavior of Wazuh after using the PUT /agents/reconnect API endpoint to force the reconnection of the wazuh-agent to the wazuh-manager.

The background of all this is to be able to restart the TCP connection established between the wazuh-agent and wazuh-manager without restarting the wazuh-agent service itself (which would imply restarting the daemons, proving new unwanted scans) so that if a load balancer is used, it can redirect that wazuh-agent to another possible wazuh-manager node in the cluster that has less load.

Reference issue: https://github.com/wazuh/wazuh/issues/7896

jmv74211 commented 2 years ago

Working on this issue.

Estimated tasks

[x] (T1): Build and deploy required environment
[x] (T2): Configure load balancer
[x] (T3): Connect agents to the load balancer and test PUT /agents/reconnect endpoint to analyze results.
[x] (T4): Connect an agent to the manager, force a connection restart and check that it continues reporting alerts.

Conclusion 🟢

It has been observed that the call to the PUT /agents/reconnect endpoint works as expected. The TCP connection established between the wazuh-agent and the wazuh-manager is restarted without having to restart the wazuh-agent service. If the wazuh-agent is connected to a load balancer, this serves to allow the wazuh-agent to connect to another worker of the wazuh-cluster.

It has also been verified that restarting the connection with the same wazuh-manager does not cause any kind of failure, since after that restart the wazuh-manager continues to report wazuh-agent alerts.

Checks

[x] The PUT /agents/reconnect endpoint works to force a wazuh-agent or several (specified as list in the request parameters) to restart connection.
[x] TCP connection is successfully restarted
[x] The wazuh-agent service is not restarted.
[x] Module scans are not launched after connection restart.
[x] If the wazuh-agent is connected to a load balancer and has no persistent connection configuration, the wazuh-agent can connect to another node in the cluster.
[x] If the wazuh-agent is connected to a wazuh-manager and the connection is restarted, the wazuh-agent continues to report without problems and the wazuh-manager generates the wazuh-agent alerts.

jmv74211 commented 2 years ago

Task 1: Build and deploy required environment

First, I will deploy an environment with the following features:

Installation	Version	Package date
Packages	v4.3.0-rc5	March 29

Manager 1: Master - RPM
Manager 2: Worker 1 - DEB
Manager 3: Worker 2 - RPM
Manager 4: Worker 3 - DEB
Load balancer: NGINX
Agent 1: DEB
Agent 2: RPM
Agent 3: DEB
Agent 4: RPM
Agent 5: DEB

jmv74211 commented 2 years ago

Task 1: Build and deploy required environment

Cluster nodes

NAME      TYPE    VERSION  ADDRESS      
master    master  4.3.0    172.16.1.40  
worker-2  worker  4.3.0    172.16.1.42  
worker-1  worker  4.3.0    172.16.1.41  
worker-3  worker  4.3.0    172.16.1.43

Agent connections

ID   NAME           IP         STATUS  VERSION       NODE NAME  
000  wazuh-master   127.0.0.1  active  Wazuh v4.3.0  master     
001  wazuh-agent-1  10.0.2.15  active  Wazuh v4.3.0  worker-1   
002  wazuh-agent-5  10.0.2.15  active  Wazuh v4.3.0  worker-1   
003  wazuh-agent-3  10.0.2.15  active  Wazuh v4.3.0  worker-3   
004  wazuh-agent-2  10.0.2.15  active  Wazuh v4.3.0  worker-3   
005  wazuh-agent-4  10.0.2.15  active  Wazuh v4.3.0  worker-1

Note: It seems that for whatever reason they have not connected to the worker-2. We will check this when we reconnect using the API endpoints.

jmv74211 commented 2 years ago

Task 2: Configure load balancer

NGINX configuration

load_module /usr/lib/nginx/modules/ngx_stream_module.so;

events {}

stream {
    upstream master {
        server 172.16.1.40:1515;
    }
    upstream mycluster {
        hash $remote_addr consistent;
        server 172.16.1.41:1514;
        server 172.16.1.42:1514;
        server 172.16.1.43:1514;
    }
    server {
        listen 1515;
        proxy_pass master;
    }
    server {
        listen 1514;
        proxy_pass mycluster;
    }
}

jmv74211 commented 2 years ago

Task 3: Connect agents to the load balancer and test `PUT /agents/reconnect` endpoint to analyze results.

Obtaining authentication token (using default credentials)

TOKEN=$(curl -u wazuh:wazuh -k -X GET "https://172.16.1.40:55000/security/user/authenticate?raw=true")

The current connection of wazuh-agent-1 is with worker-1.

001  wazuh-agent-1  10.0.2.15  active  Wazuh v4.3.0  worker-1

After making the reconnection request:

curl -X PUT https://172.16.1.40:55000/agents/reconnect?agents_list=001  -H "Authorization: Bearer $TOKEN"

The log of wazuh-agent-1 shows that it has been reconnected:

2022/04/12 10:36:46 wazuh-agentd: INFO: Wazuh Agent will be reconnected because a reconnect message was received
2022/04/12 10:36:46 wazuh-agentd: INFO: Closing connection to server (172.16.1.50:1514/tcp).
2022/04/12 10:36:46 wazuh-agentd: INFO: Trying to connect to server (172.16.1.50:1514/tcp).
2022/04/12 10:36:46 wazuh-agentd: INFO: (4102): Connected to the server (172.16.1.50:1514/tcp).

And when checking which worker the wazuh-agent-1 has connected to, we see how it has connected to it:

001  wazuh-agent-1  10.0.2.15  active  Wazuh v4.3.0  worker-1

This is because the NGINX configuration has applied the hash $remote_addr consistent; directive that makes connections persistent.

After commenting this directive, restarting the NGINX service, and calling the endpoint again to force the reconnection of wazuh-agent-1, we see how it has connected to a new worker, in this case to worker-3:

001  wazuh-agent-1  10.0.2.15  active  Wazuh v4.3.0  worker-3

If we force another reconnection, we see that in this case it has changed to worker-2:

001  wazuh-agent-1  10.0.2.15  active  Wazuh v4.3.0  worker-2

jmv74211 commented 2 years ago

Task 4: Connect an agent to the manager, force a connection restart and check that it continues reporting alerts.

I have configured a wazuh-agent to always report to the same wazuh-manager.

I then applied the following syscheck configuration to monitor a directory on the wazuh-agent host:

<directories realtime="yes">/var/log/test</directories>

After that, I forced the connection to restart.

curl -X PUT -k https://172.16.1.40:55000/agents/reconnect?agents_list=001  -H "Authorization: Bearer $TOKEN"

I have generated a new file to force a new alert:

echo "test" >> /var/log/test/a.txt

It has been observed how the alert has been generated correctly in the wazuh-manager.

** Alert 1649770780.2104758: - ossec,syscheck,syscheck_entry_added,syscheck_file,pci_dss_11.5,gpg13_4.11,gdpr_II_5.1.f,hipaa_164.312.c.1,hipaa_164.312.c.2,nist_800_53_SI.7,tsc_PI1.4,tsc_PI1.5,tsc_CC6.1,tsc_CC6.8,tsc_CC7.2,tsc_CC7.3,
2022 Apr 12 13:39:40 (wazuh-agent-1) any->syscheck
Rule: 554 (level 5) -> 'File added to the system.'
File '/var/log/test/a.txt' added
Mode: realtime

Attributes:
 - Size: 5
 - Permissions: rw-r--r--
 - Date: Tue Apr 12 13:39:41 2022
 - Inode: 1049535
 - User: root (0)
 - Group: root (0)
 - MD5: d8e8fca2dc0f896fd7cb4cb0031ba249
 - SHA1: 4e1243bd22c66e76c2ba9eddc1f91394e57f9f83
 - SHA256: f2ca1bb6c7e907d06dafe4687e579fce76b37e4e93b7605022da52e6ccc26fd2

wazuh / wazuh-qa