wazuh / wazuh-qa

Wazuh - Quality Assurance
GNU General Public License v2.0
65 stars 32 forks source link

Unexpected errors and warnings in the vulnerability detection E2E tests #5698

Closed Rebits closed 1 month ago

Rebits commented 2 months ago

Description

Unexpected errors and warnings were reported in https://github.com/wazuh/wazuh/issues/25363. It's necessary to perform a thorough research of them and include in case of they are expected in the errors to ignore dictionary in the vulnerability tests.

Evidences

Report: https://github.com/wazuh/wazuh/issues/25363#issue-2482492226

Tasks

Rebits commented 2 months ago

We can check that all these warnings/errors can be splitter to:

Unexpected connection error of agent5

During the registration process, the agent was successfully registered with the manager. However, a network error occurred, interrupting communication. As a result, the agent could not complete its registration. When it attempted to register again, it found that an agent with the same name was already present. The agent was successfully registered after 10 minutes.

2024/08/25 19:47:28 wazuh-agentd: ERROR: (1137): Lost connection with manager. Setting lock.
2024/08/25 19:47:28 wazuh-agentd: INFO: Closing connection to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:28 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:28 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:33 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:33 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:38 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:38 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:43 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:43 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:48 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:48 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:48 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:47:48 wazuh-agentd: ERROR: (1208): Unable to connect to enrollment service at '[172.31.9.129]:1515'
2024/08/25 19:47:53 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:47:53 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:47:53 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:53 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:58 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:58 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:03 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:03 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:08 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:08 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:11 wazuh-modulesd:syscollector: INFO: Starting evaluation.
2024/08/25 19:48:11 wazuh-modulesd: WARNING: Process locked due to agent is offline. Waiting for connection...
2024/08/25 19:48:13 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:13 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:13 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:48:13 wazuh-agentd: ERROR: (1208): Unable to connect to enrollment service at '[172.31.9.129]:1515'
2024/08/25 19:48:17 wazuh-logcollector: WARNING: Process locked due to agent is offline. Waiting for connection...
2024/08/25 19:48:18 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:48:18 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:48:18 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:18 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:23 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:23 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:28 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:28 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:33 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:33 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:38 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:38 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:38 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:48:44 wazuh-agentd: INFO: No authentication password provided
2024/08/25 19:48:44 wazuh-agentd: INFO: Using agent name as: agent5
2024/08/25 19:48:44 wazuh-agentd: INFO: Waiting for server reply
2024/08/25 19:48:46 wazuh-agentd: ERROR: Duplicate name (from manager)
2024/08/25 19:48:46 wazuh-agentd: ERROR: Unable to add agent (from manager)
2024/08/25 19:48:51 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:48:51 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:48:51 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:51 wazuh-agentd: INFO: (4102): Connected to the server ([172.31.9.129]:1514/tcp).

Conclusion

Both errors occurred in agents registered with manager2 due to a network issue, causing cluster and enrollment errors. These errors are not considered expected, so they have been excluded from the expected errors dictionary in the test. I suggest including a brief delay between the startup of the agent and the manager to prevent similar issues in the future. The inclusion of logic to remove defined client.keys has been discarded, as this is not typical user behavior.

Rebits commented 2 months ago
Rebits commented 2 months ago

On hold due to no macOS are available: https://github.com/wazuh/wazuh/issues/25345

Rebits commented 2 months ago

In order to test this development along with https://github.com/wazuh/wazuh-qa/issues/5699 it was created the branch tmp-fixes-4.9.0 that contains both branches.

I am currently testing over 4.9.0-rc2. Build: https://ci.wazuh.info/job/Test_e2e_system/365/

Rebits commented 2 months ago

A new build was launched to include newer changes in https://github.com/wazuh/wazuh-qa/issues/5699#issuecomment-2326993520 Buid: https://ci.wazuh.info/job/Test_e2e_system/368/

Rebits commented 2 months ago

Conclusion

The current approach results in constant updates of non-critical warnings in the expected list. To address this, we've decided to remove warning messages from the unexpected lists, reducing false positive errors related to manager connections, networking, and similar issues.

Additionally, a delay was introduced between the startup of the manager and the agents. This ensures the manager has time to fully initialize before the agents start, thereby preventing potential connection errors.

jseg380 commented 2 months ago

LGTM

Rebits commented 1 month ago

Waiting until https://github.com/wazuh/wazuh-qa/issues/5715 to tests changes in the same build

Rebits commented 1 month ago

Build: https://ci.wazuh.info/job/Test_e2e_system/382/ Report: Test_e2e_system_382_test_vulnerability_detector.zip

Rebits commented 1 month ago

Report analysis

The evidence can be found at https://github.com/wazuh/wazuh-qa/issues/5698#issuecomment-2353118831

Cluster connection error before the test started

We can notice in the manager2 errors in authd and db daemons related to the registrations of some agents:

2024/09/17 10:07:43 wazuh-authd: INFO: (1225): SIGNAL [(15)-(Terminated)] Received. Exit Cleaning...
2024/09/17 10:07:43 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:44 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:45 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:46 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:47 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:48 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:49 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:50 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:51 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:51 wazuh-authd: ERROR: Could not send message through the cluster after '10' attempts.
...
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '001'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '001'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '002'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '002'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '003'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '003'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '004'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '004'
...
2024/09/17 10:08:18 wazuh-remoted: WARNING: (1408): Invalid ID 005 for the source ip: '172.31.13.221' (name 'unknown').
2024/09/17 10:08:23 wazuh-remoted: INFO: (1409): Authentication file changed. Updating.
2024/09/17 10:08:23 wazuh-remoted: INFO: (1410): Reading authentication keys file.
2024/09/17 10:08:46 wazuh-authd: INFO: New connection from 172.31.1.115
2024/09/17 10:08:46 wazuh-authd: INFO: Received request for a new agent (agent5) from: 172.31.1.115
2024/09/17 10:08:46 wazuh-authd: INFO: Dispatching request to master node
2024/09/17 10:08:46 wazuh-authd: INFO: Agent key generated for 'agent5' (requested by any)
2024/09/17 10:08:52 wazuh-db: WARNING: The groups were empty right after the set for agent '006'

This behavior has led to agent2 not being registered, making failing the initial tests. However, we can see that the agent in the rest of the tests registers with the manager and its vulnerabilities are being detected correctly.

These errors appear before the tests start, during the provision stages. A similar error was reported in https://github.com/wazuh/wazuh/issues/20915, https://github.com/wazuh/wazuh/issues/22855 and https://github.com/wazuh/wazuh/issues/16464. This race condition was [discarded](see https://github.com/wazuh/wazuh/issues/22855#issuecomment-2045538200) due to the low impact (agent finally registered).

Regarding Vulnerability E2E, it was created an issue to mitigate this issue https://github.com/wazuh/wazuh-qa/issues/5746

Missing vulnerabilities

Known issues specified in https://github.com/wazuh/wazuh/issues/25506

juliamagan commented 1 month ago

https://github.com/wazuh/wazuh-qa/issues/5746 should not be necessary, since what is intended to be fixed in the new issue are the same connection errors reported in the main comment, so this should be fixed in this issue itself.

Rebits commented 1 month ago

In the provisioning trace, we can see the following error that could specifically explain the errors regarding the cluster

19:53:52  TASK [../roles/wazuh/ansible-wazuh-manager : Install Wazuh Manager from .deb packages] ***
19:53:52  fatal: [master]: FAILED! => {
19:53:52      "changed": false
19:53:52  }
19:53:52  
19:53:52  STDERR:
19:53:52  
19:53:52  dpkg: error: dpkg frontend lock was locked by another process with pid 14257
19:53:52  Note: removing the lock file is always wrong, and can end up damaging the
19:53:52  locked area and the entire system. See <https://wiki.debian.org/Teams/Dpkg/FAQ>.
19:53:52  
19:53:52  
19:53:52  
19:53:52  MSG:
19:53:52  
19:53:52  dpkg --force-confdef --force-confold -i /tmp/ansible-tmp-1724612019.8497238-7973-66496375523271/wazuh-manager_4.9.0-1_amd64y_x3ilkb.deb failed

It seems that the retry in the provisioning could lead to an unstable state where is more probable that unexpected errors or warnings like the specified in the issue occur.

I suggest including a retry logic in this task.

Rebits commented 1 month ago

Build after applying changes in provisioning methods: https://ci.wazuh.info/job/Test_e2e_system/384