Closed pro-akim closed 2 weeks ago
Based on a local test of version 4.8.1 conducted in a meeting with @juliamagan, we have observed the following:
agent.state_interval
).agent-auth.exe
, there is an additional delay corresponding to the wazuh-remoted
configuration (remoted.key_update_interval
).It is highly likely that the discrepancies we are observing are due to configuration misalignments. However, considering the following points:
For these reasons, we do not believe this issue should be a blocker for the release of version 4.9.0.
Hello @pro-akim Is it possible to repeat the test using version 4.9.0 on an AMI where the unexpected behaviour is reproduced, and then another test on an AMI where it behaves correctly, always with 4.9.0? without destroying the environment.
Yes @cborla, the test was always performed with 4.9.0. What is possible is to run the test without destroying the EC2s and then enter them to perform tests (this is how I found the difference against 4.8.1). I have not been able to reproduce the same behavior in Vagrant using the same machines
Hi guys,
We think that having the ossec.log and _local_internaloptions.conf files should be enough to explain the behavior of file wazuh-agent.state.
Therefore, we don't need access to the environment, but just those files for version 4.9.0:
In the environment where we noted different behavior:
Main release stage issue # | # |
Main footprint metrics issue # | #25092 |
Version | 4.9.0 |
Release stage # | Beta 1 |
Tag | https://github.com/wazuh/wazuh/tree/v4.9.0-beta1 |
Repository: packages-dev.wazuh.com
Package path: pre-release
Package revision: 1
Jenkins build: https://ci.wazuh.info/job/Test_stress/5584/
The stress test was run again for 20 minutes, the result is attached in the previous comment.
windows.debug=2
.FREQUENCY_SCAN = 1
Graph
Monitor log
As can be seen in the monitor.log
file, the reading of the wazuh-agent.state
file is approximately every 1 second.
2024-09-04 19:58:31,406 Getting statistics data from C:\Program Files (x86)\ossec-agent\wazuh-agent.state
2024-09-04 19:58:31,406 Writing agentd.state info to Test_stress_B5584_windows_agentd_state.csv.
2024-09-04 19:58:31,923 Writing binary info to monitor-winagent-Test_stress_B5584_windows-pre-release.csv.
2024-09-04 19:58:32,938 Collecting resources usage of wazuh-agent.exe.
2024-09-04 19:58:32,938 Getting statistics data from C:\Program Files (x86)\ossec-agent\wazuh-agent.state
2024-09-04 19:58:32,938 Writing agentd.state info to Test_stress_B5584_windows_agentd_state.csv.
2024-09-04 19:58:33,454 Writing binary info to monitor-winagent-Test_stress_B5584_windows-pre-release.csv.
2024-09-04 19:58:34,455 Collecting resources usage of wazuh-agent.exe.
2024-09-04 19:58:34,455 Getting statistics data from C:\Program Files (x86)\ossec-agent\wazuh-agent.state
Agent ossec.log
When agent debugging is enabled, when updating the wazuh-agent.state file, the message state.c:78 at write_state() is printed: DEBUG: Updating state file.
The log shows that the Updating state file message is printed the same second the agent connects to the manager, the order in which they are printed in this case does not determine which came first.
Also, it can be observed that the thread in charge of creating/updating the file is created in the same second that the agent is informing that it is connected.
Pendings
There are only 2 peaks in the graph, in the log sequence, which are the only cases where the Updating state file
message is above the connection message and in the same second.
2024/09/04 20:02:04 wazuh-agent[6100] state.c:78 at write_state(): DEBUG: Updating state file.
2024/09/04 20:02:04 wazuh-agent[6100] start_agent.c:365 at agent_handshake_to_server(): INFO: (4102): Connected to the server ([172.31.2.9]:1514/tcp).
2024/09/04 20:03:25 wazuh-agent[5656] state.c:78 at write_state(): DEBUG: Updating state file. 2024/09/04 20:03:25 wazuh-agent[5656] start_agent.c:365 at agent_handshake_to_server(): INFO: (4102): Connected to the server ([172.31.2.9]:1514/tcp).
- CSV file pending state.
![image](https://github.com/user-attachments/assets/73271289-db1f-4e56-bbc7-8efe9d0fa806)
- ossec.log filtered only the messages Updating state file and manager connection message.
[ossec_filtered.zip](https://github.com/user-attachments/files/16882614/ossec_filtered.zip)
### Conclusion
- Even if the sampling rate of the test is set to 1 second, it is not guaranteed to read the pending state of the agent.
- When the agent is started, the thread that creates and updates the wazuh-aget.state file is launched and in the AWS environment, it usually happens after the agent has established the connection with the manager.
- Counting the number of pendings of a csv file does not guarantee that it represents the number of times that the agent disconnected, in this case the sampling frequency is higher, that's why you see more, but it doesn't mean that a delay in the connection can't occur.
Analyzing the issue https://github.com/wazuh/wazuh-qa/issues/5705 It was observed that there is a different behavior in Windows 2019 AWS (ami-0bf33f4cb48993eb) to other Windows operating systems reviewed (Vagrant Windows 2019, Desktop 10) (AWS Windows 2012). This behavior has to do with the fact that in this AMI, when the agent is restarted, the state file transitions extremely quickly from pending to connected, altering the morphology of the stress test graphs starting from 4.9.0-Alpha1. This behavior was not present in 4.8.1 and it is understood that the only change made was
This behavior could not be replicated in all operating systems but specifically in the one mentioned.
Details
Testing in AWS
Windows Server 2019 Datacenter 1809 (Build 17763.1999) c5a.2xlarge The c5a.2xlarge instance is in the compute optimized family with 8 vCPUs, 16.0 GiB
4,9,0 Build: https://ci.wazuh.info/job/Test_stress/5554/ B5554_agent_windows.tar.gz
The behaviour is different, almost no pending is shown in the screen (monitoring time 1s)
ossec.conf
```4.8.1 Build: https://ci.wazuh.info/job/Test_stress/5559/ B5559_agent_windows.tar.gz
ossec.conf
```The result of this analysis should allow us to define whether this behavior is expected or abnormal and, through this, allow us to define the new graphics as a new standard or make changes to the stress tests.