Closed Rebits closed 1 month ago
We can check that all these warnings/errors can be splitter to:
agent5
During the registration process, the agent was successfully registered with the manager. However, a network error occurred, interrupting communication. As a result, the agent could not complete its registration. When it attempted to register again, it found that an agent with the same name was already present. The agent was successfully registered after 10 minutes.
2024/08/25 19:47:28 wazuh-agentd: ERROR: (1137): Lost connection with manager. Setting lock.
2024/08/25 19:47:28 wazuh-agentd: INFO: Closing connection to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:28 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:28 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:33 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:33 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:38 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:38 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:43 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:43 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:48 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:48 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:48 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:47:48 wazuh-agentd: ERROR: (1208): Unable to connect to enrollment service at '[172.31.9.129]:1515'
2024/08/25 19:47:53 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:47:53 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:47:53 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:53 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:47:58 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:47:58 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:03 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:03 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:08 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:08 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:11 wazuh-modulesd:syscollector: INFO: Starting evaluation.
2024/08/25 19:48:11 wazuh-modulesd: WARNING: Process locked due to agent is offline. Waiting for connection...
2024/08/25 19:48:13 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:13 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:13 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:48:13 wazuh-agentd: ERROR: (1208): Unable to connect to enrollment service at '[172.31.9.129]:1515'
2024/08/25 19:48:17 wazuh-logcollector: WARNING: Process locked due to agent is offline. Waiting for connection...
2024/08/25 19:48:18 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:48:18 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:48:18 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:18 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:23 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:23 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:28 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:28 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:33 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:33 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:38 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:38 wazuh-agentd: ERROR: (1216): Unable to connect to '[172.31.9.129]:1514/tcp': 'Connection refused'.
2024/08/25 19:48:38 wazuh-agentd: INFO: Requesting a key from server: 172.31.9.129
2024/08/25 19:48:44 wazuh-agentd: INFO: No authentication password provided
2024/08/25 19:48:44 wazuh-agentd: INFO: Using agent name as: agent5
2024/08/25 19:48:44 wazuh-agentd: INFO: Waiting for server reply
2024/08/25 19:48:46 wazuh-agentd: ERROR: Duplicate name (from manager)
2024/08/25 19:48:46 wazuh-agentd: ERROR: Unable to add agent (from manager)
2024/08/25 19:48:51 wazuh-agentd: WARNING: (4101): Waiting for server reply (not started). Tried: '172.31.9.129'. Ensure that the manager version is 'v4.9.0' or higher.
2024/08/25 19:48:51 wazuh-agentd: WARNING: Unable to connect to any server.
2024/08/25 19:48:51 wazuh-agentd: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:51 wazuh-agentd: INFO: (4102): Connected to the server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:49 wazuh-agent: INFO: Trying to connect to server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:49 wazuh-agent: INFO: (4102): Connected to the server ([172.31.9.129]:1514/tcp).
2024/08/25 19:48:49 wazuh-agent: INFO: Server responded. Releasing lock.
2024/08/25 19:48:52 wazuh-agent: INFO: Agent is now online. Process unlocked, continuing...
Both errors occurred in agents registered with manager2 due to a network issue, causing cluster and enrollment errors. These errors are not considered expected, so they have been excluded from the expected errors dictionary in the test. I suggest including a brief delay between the startup of the agent and the manager to prevent similar issues in the future. The inclusion of logic to remove defined client.keys has been discarded, as this is not typical user behavior.
bug/5698-include-timeout-prevent-network-errors
On hold due to no macOS are available: https://github.com/wazuh/wazuh/issues/25345
In order to test this development along with https://github.com/wazuh/wazuh-qa/issues/5699 it was created the branch tmp-fixes-4.9.0
that contains both branches.
I am currently testing over 4.9.0-rc2. Build: https://ci.wazuh.info/job/Test_e2e_system/365/
A new build was launched to include newer changes in https://github.com/wazuh/wazuh-qa/issues/5699#issuecomment-2326993520 Buid: https://ci.wazuh.info/job/Test_e2e_system/368/
The current approach results in constant updates of non-critical warnings in the expected list. To address this, we've decided to remove warning messages from the unexpected lists, reducing false positive errors related to manager connections, networking, and similar issues.
Additionally, a delay was introduced between the startup of the manager and the agents. This ensures the manager has time to fully initialize before the agents start, thereby preventing potential connection errors.
LGTM
Waiting until https://github.com/wazuh/wazuh-qa/issues/5715 to tests changes in the same build
The evidence can be found at https://github.com/wazuh/wazuh-qa/issues/5698#issuecomment-2353118831
We can notice in the manager2 errors in authd and db daemons related to the registrations of some agents:
2024/09/17 10:07:43 wazuh-authd: INFO: (1225): SIGNAL [(15)-(Terminated)] Received. Exit Cleaning...
2024/09/17 10:07:43 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:44 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:45 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:46 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:47 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:48 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:49 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:50 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:51 wazuh-authd: WARNING: Could not connect to socket 'queue/cluster/c-internal.sock': Connection refused (111).
2024/09/17 10:07:51 wazuh-authd: ERROR: Could not send message through the cluster after '10' attempts.
...
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '001'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '001'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '002'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '002'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '003'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '003'
2024/09/17 10:08:12 wazuh-db: ERROR: There was an error assigning the groups to agent '004'
2024/09/17 10:08:12 wazuh-db: WARNING: The groups were empty right after the set for agent '004'
...
2024/09/17 10:08:18 wazuh-remoted: WARNING: (1408): Invalid ID 005 for the source ip: '172.31.13.221' (name 'unknown').
2024/09/17 10:08:23 wazuh-remoted: INFO: (1409): Authentication file changed. Updating.
2024/09/17 10:08:23 wazuh-remoted: INFO: (1410): Reading authentication keys file.
2024/09/17 10:08:46 wazuh-authd: INFO: New connection from 172.31.1.115
2024/09/17 10:08:46 wazuh-authd: INFO: Received request for a new agent (agent5) from: 172.31.1.115
2024/09/17 10:08:46 wazuh-authd: INFO: Dispatching request to master node
2024/09/17 10:08:46 wazuh-authd: INFO: Agent key generated for 'agent5' (requested by any)
2024/09/17 10:08:52 wazuh-db: WARNING: The groups were empty right after the set for agent '006'
This behavior has led to agent2 not being registered, making failing the initial tests. However, we can see that the agent in the rest of the tests registers with the manager and its vulnerabilities are being detected correctly.
These errors appear before the tests start, during the provision stages. A similar error was reported in https://github.com/wazuh/wazuh/issues/20915, https://github.com/wazuh/wazuh/issues/22855 and https://github.com/wazuh/wazuh/issues/16464. This race condition was [discarded](see https://github.com/wazuh/wazuh/issues/22855#issuecomment-2045538200) due to the low impact (agent finally registered).
Regarding Vulnerability E2E, it was created an issue to mitigate this issue https://github.com/wazuh/wazuh-qa/issues/5746
Known issues specified in https://github.com/wazuh/wazuh/issues/25506
https://github.com/wazuh/wazuh-qa/issues/5746 should not be necessary, since what is intended to be fixed in the new issue are the same connection errors reported in the main comment, so this should be fixed in this issue itself.
In the provisioning trace, we can see the following error that could specifically explain the errors regarding the cluster
19:53:52 TASK [../roles/wazuh/ansible-wazuh-manager : Install Wazuh Manager from .deb packages] ***
19:53:52 fatal: [master]: FAILED! => {
19:53:52 "changed": false
19:53:52 }
19:53:52
19:53:52 STDERR:
19:53:52
19:53:52 dpkg: error: dpkg frontend lock was locked by another process with pid 14257
19:53:52 Note: removing the lock file is always wrong, and can end up damaging the
19:53:52 locked area and the entire system. See <https://wiki.debian.org/Teams/Dpkg/FAQ>.
19:53:52
19:53:52
19:53:52
19:53:52 MSG:
19:53:52
19:53:52 dpkg --force-confdef --force-confold -i /tmp/ansible-tmp-1724612019.8497238-7973-66496375523271/wazuh-manager_4.9.0-1_amd64y_x3ilkb.deb failed
It seems that the retry in the provisioning could lead to an unstable state where is more probable that unexpected errors or warnings like the specified in the issue occur.
I suggest including a retry logic in this task.
Build after applying changes in provisioning methods: https://ci.wazuh.info/job/Test_e2e_system/384
Description
Unexpected errors and warnings were reported in https://github.com/wazuh/wazuh/issues/25363. It's necessary to perform a thorough research of them and include in case of they are expected in the errors to ignore dictionary in the vulnerability tests.
Evidences
Report: https://github.com/wazuh/wazuh/issues/25363#issue-2482492226
Tasks