wazuh / wazuh-qa

Wazuh - Quality Assurance
GNU General Public License v2.0
64 stars 30 forks source link

Differences in Agent State Handling Between UNIX and Windows #5705

Closed vikman90 closed 2 weeks ago

vikman90 commented 2 weeks ago
Target version Related issue Related PR/dev branch
https://github.com/wazuh/wazuh/issues/25092

Description

During our investigation into the agent's behavior, we observed that the Windows agent's state graph differs from that of other systems. This discrepancy was expected for versions prior to 4.9.0, as we resolved a bug in version 4.9.0 that caused the state file to be left behind when the agent shut down:

Given this, we suspect that the issue may lie in how the state file is handled and the agent's behavior when the file is not found. This might be due to differences in file handling between UNIX and Windows environments.

Proposed Actions

  1. Review the Plot Value: Investigate the value that is displayed in the plot when the state file does not exist. This will help determine if the discrepancy is due to differences in handling missing state files between UNIX and Windows.
  2. Explore Test-Level Solutions:
    • Open the state file with write permission for third-party applications.
    • Reduce the frequency of checks or operations involving the state file.
    • Optimize the code to perform a read operation as quickly as possible.
pro-akim commented 2 weeks ago

Initial Analysis

Reviewing the information provided by @cborla at https://github.com/wazuh/wazuh/issues/25092#issuecomment-2274700440 Reviewing the error message that appears in reference to the non-access of monitor.py

Analyzing ossec_Test_stress_B5477_windows_2024-08-05.zip

akim@akim-PC:~/Downloads/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 't access the file'
2024-08-03 02:47:33,469 Couldn't access the file
2024-08-03 02:54:16,141 Couldn't access the file
2024-08-03 03:07:41,852 Couldn't access the file
2024-08-03 03:14:24,423 Couldn't access the file
2024-08-03 03:27:50,149 Couldn't access the file
2024-08-03 03:34:32,687 Couldn't access the file
2024-08-03 03:47:58,363 Couldn't access the file
2024-08-03 03:54:40,920 Couldn't access the file
2024-08-03 04:08:06,480 Couldn't access the file
2024-08-03 04:14:49,048 Couldn't access the file
2024-08-03 04:28:14,720 Couldn't access the file
2024-08-03 04:34:57,312 Couldn't access the file
2024-08-03 04:41:39,900 Couldn't access the file
2024-08-03 04:55:05,595 Couldn't access the file
2024-08-03 05:01:48,102 Couldn't access the file
2024-08-03 05:15:13,799 Couldn't access the file
2024-08-03 05:21:56,421 Couldn't access the file
2024-08-03 05:28:39,009 Couldn't access the file
2024-08-03 05:42:04,697 Couldn't access the file
2024-08-03 05:48:47,276 Couldn't access the file
.
.
.

The time it took between the appearance of one message and another was

Between 02:47:33 and 02:54:16: 6 minutes and 43 seconds. Between 02:54:16 and 03:07:41: 13 minutes and 25 seconds. Between 03:07:41 and 03:14:24: 6 minutes and 43 seconds. Between 03:14:24 and 03:27:50: 13 minutes and 26 seconds. Between 03:27:50 and 03:34:32: 6 minutes and 42 seconds. Between 03:34:32 and 03:47:58: 13 minutes and 26 seconds. Between 03:47:58 and 03:54:40: 6 minutes and 42 seconds. Between 03:54:40 and 04:08:06: 13 minutes and 26 seconds. Between 04:08:06 and 04:14:49: 6 minutes and 43 seconds. Between 04:14:49 and 04:28:14: 13 minutes and 25 seconds. Between 04:28:14 and 04:34:57: 6 minutes and 43 seconds. Between 04:34:57 and 04:41:39: 6 minutes and 42 seconds. Between 04:41:39 and 04:55:05: 13 minutes and 26 seconds. Between 04:55:05 and 05:01:48: 6 minutes and 43 seconds. Between 05:01:48 and 05:15:13: 13 minutes and 25 seconds. Between 05:15:13 and 05:21:56: 6 minutes and 43 seconds. Between 05:21:56 and 05:28:39: 6 minutes and 43 seconds. Between 05:28:39 and 05:42:04: 13 minutes and 25 seconds. Between 05:42:04 and 05:48:47: 6 minutes and 43 seconds.

Which makes me think there is a pattern.


Considering that the monitor logs every:

akim@akim-PC:~/Downloads/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 'Writing binary' | head 2024-08-02 20:42:06,048 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:11,566 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:17,085 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:22,604 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:28,123 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:33,641 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:39,163 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:44,681 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:50,201 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.
2024-08-02 20:42:55,719 Writing binary info to monitor-winagent-Test_stress_B5477_windows-pre-release.csv.

Between 20:42:06 and 20:42:11: 5.52 seconds Between 20:42:11 and 20:42:17: 5.52 seconds Between 20:42:17 and 20:42:22: 5.52 seconds Between 20:42: 22 and 20:42:28: 5.52 seconds Between 20:42:28 and 20:42:33: 5.52 seconds Between 20:42:33 and 20:42:39: 5.52 seconds Between 20:42:39 and 20:42 :44: 5.52 seconds Between 20:42:44 and 20:42:50: 5.52 seconds Between 20:42:50 and 20:42:55: 5.52 seconds


On the other hand, the presence of the log that reports an error goes until from until: 2024-08-03 02:47:33,469 to 2024-08-04 20:35:23,256

akim@akim-PC:~/Downloads/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 't access the file' | head 
2024-08-03 02:47:33,469 Couldn't access the file
2024-08-03 02:54:16,141 Couldn't access the file
2024-08-03 03:07:41,852 Couldn't access the file
2024-08-03 03:14:24,423 Couldn't access the file
2024-08-03 03:27:50,149 Couldn't access the file
2024-08-03 03:34:32,687 Couldn't access the file
2024-08-03 03:47:58,363 Couldn't access the file
2024-08-03 03:54:40,920 Couldn't access the file
2024-08-03 04:08:06,480 Couldn't access the file
2024-08-03 04:14:49,048 Couldn't access the file
akim@akim-PC:~/Downloads/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 't access the file' | tail
2024-08-04 18:41:15,402 Couldn't access the file
2024-08-04 18:47:57,986 Couldn't access the file
2024-08-04 19:01:23,636 Couldn't access the file
2024-08-04 19:14:49,312 Couldn't access the file
2024-08-04 19:28:14,972 Couldn't access the file
2024-08-04 19:41:40,600 Couldn't access the file
2024-08-04 19:55:06,284 Couldn't access the file
2024-08-04 20:08:31,968 Couldn't access the file
2024-08-04 20:21:57,648 Couldn't access the file
2024-08-04 20:35:23,256 Couldn't access the file
pro-akim commented 2 weeks ago

Reviewing ossec_Test_stress_B5523_windows_2024-08-25.zip

akim@akim-PC:~/Downloads/B5523_agent_windows/logs/ossec_Test_stress_B5523_windows_2024-08-25$ cat monitor.log | grep 't access the file' | head 
2024-08-23 15:03:22,260 Couldn't access the file
2024-08-23 15:10:04,865 Couldn't access the file
2024-08-23 15:30:13,230 Couldn't access the file
2024-08-23 15:36:55,770 Couldn't access the file
2024-08-23 15:43:38,374 Couldn't access the file
2024-08-23 15:57:04,017 Couldn't access the file
2024-08-23 16:03:46,620 Couldn't access the file
2024-08-23 16:10:29,195 Couldn't access the file
2024-08-23 16:23:54,801 Couldn't access the file
2024-08-23 16:30:37,404 Couldn't access the file
akim@akim-PC:~/Downloads/B5523_agent_windows/logs/ossec_Test_stress_B5523_windows_2024-08-25$ cat monitor.log | grep 't access the file' | tail
2024-08-25 07:34:27,147 Couldn't access the file
2024-08-25 07:41:09,725 Couldn't access the file
2024-08-25 07:47:52,313 Couldn't access the file
2024-08-25 08:01:17,986 Couldn't access the file
2024-08-25 08:08:00,580 Couldn't access the file
2024-08-25 08:21:26,287 Couldn't access the file
2024-08-25 08:28:08,903 Couldn't access the file
2024-08-25 08:34:51,486 Couldn't access the file
2024-08-25 08:48:17,115 Couldn't access the file
2024-08-25 08:54:59,705 Couldn't access the file

Considering the appearance of the initial log:

Between 15:03:22 and 15:10:04: 6 minutes and 42 seconds Between 15:10:04 and 15:30:13: 20 minutes and 9 seconds Between 15:30:13 and 15:36:55: 6 minutes and 42 seconds Between 15:36:55 and 15:43:38: 6 minutes and 43 seconds Between 15:43:38 and 15:57:04: 13 minutes and 26 seconds Between 15:57:04 and 16:03:46: 6 minutes and 42 seconds Between 16:03:46 and 16:10:29: 6 minutes and 43 ...36:55 and 15:43:38: 6 minutes and 43 seconds Between 15:36:55 and 15:4 16:10:29 and 16:23:54: 13 minutes and 25 seconds Between 16:23:54 and 16:30:37: 6 minutes and 43 seconds

pro-akim commented 2 weeks ago

The question I now ask myself is: Why does monitor.py have access to the file sometimes yes, sometimes no?

  1. What is happening in the agent in those time modules, between 6/13/20 minutes.
  2. Will there be any change in the access permissions of the wazuh-agent.state file?

Apparently it is some action that happens in modules of 6 minutes inconsistently and that allows or does not allow the reading of: 'C:\\Program Files (x86)\\ossec-agent\\wazuh-agent.state'

pro-akim commented 2 weeks ago

On the other hand, the error appears 248 times while

akim@akim-PC:~/Downloads/OLD/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 't access the file' | wc -l 
248

The pending states range from 11 to 1756. It would not seem that the 248-times reading failure of the wazuh-agent.state file could be related to the absence of 1745 (1756-11) pending states.

➜  egrep "pending" Test_stress_B5394_windows_agentd_state.csv | wc -l  # 4.8.0-rc2
1756
➜  egrep "pending" Test_stress_B5477_windows_agentd_state.csv | wc -l # 4.9.0-beta1
11
pro-akim commented 2 weeks ago

After reviewing the comment https://github.com/wazuh/wazuh/issues/25092#issuecomment-2278923139

It would be understood that the times 6/13/20 could become moments where the restart of the agent coincides with the revision of the file


Reviewing in vuln.log the action performed in some of the erroneous logs

2024-08-03 02:47:33,469 Couldn't access the file

[2024-08-03_02:47:33] [INFO] (register_and_restart): Sleep of 20 seconds before reconnection ended. [2024-08-03_02:47:33] [INFO] (register_and_restart): Agent restart output: [2024-08-03_02:47:33] [INFO] (register_and_restart): Agent restarted. [2024-08-03_02:47:33] [INFO] (register_and_restart): Sleep of 60 seconds started.

2024-08-03 02:54:16,141 Couldn't access the file

[2024-08-03_02:54:16] [INFO] (register_and_restart): Agent restart output: [2024-08-03_02:54:16] [INFO] (register_and_restart): Agent restarted. [2024-08-03_02:54:16] [INFO] (register_and_restart): Sleep of 60 seconds started.

2024-08-03 03:07:41,852 Couldn't access the file

[2024-08-03_03:07:41] [INFO] (register_and_restart): Sleep of 20 seconds before reconnection ended. [2024-08-03_03:07:41] [INFO] (register_and_restart): Agent restart output: [2024-08-03_03:07:41] [INFO] (register_and_restart): Agent restarted. [2024-08-03_03:07:41] [INFO] (register_and_restart): Sleep of 60 seconds started.

2024-08-03 03:14:24,423 Couldn't access the file

[2024-08-03_03:14:24] [INFO] (register_and_restart): Sleep of 20 seconds before reconnection ended. [2024-08-03_03:14:24] [INFO] (register_and_restart): Agent restart output: [2024-08-03_03:14:24] [INFO] (register_and_restart): Agent restarted. [2024-08-03_03:14:24] [INFO] (register_and_restart): Sleep of 60 seconds started.

2024-08-03 03:27:50,149 Couldn't access the file

[2024-08-03_03:27:50] [INFO] (register_and_restart): Agent restart output: [2024-08-03_03:27:50] [INFO] (register_and_restart): Agent restarted. [2024-08-03_03:27:50] [INFO] (register_and_restart): Sleep of 60 seconds started. [2024-08-03_03:28:50] [INFO] (register_and_restart): Sleep of 60 seconds ended. [2024-08-03_03:28:50] [INFO] (register_and_restart): Remaining test time 148380. [2024-08-03_03:28:50] [INFO] (register_and_restart): Agent auth output: 2024/08/03 03:28:50 agent-auth: INFO: Started (pid: 5704).

2024-08-03 03:34:32,687 Couldn't access the file

[2024-08-03_03:34:32] [INFO] (register_and_restart): Sleep of 20 seconds before reconnection ended. [2024-08-03_03:34:32] [INFO] (register_and_restart): Agent restart output: [2024-08-03_03:34:32] [INFO] (register_and_restart): Agent restarted. [2024-08-03_03:34:32] [INFO] (register_and_restart): Sleep of 60 seconds started.

2024-08-04 19:14:49,312 Couldn't access the file

[2024-08-04_19:14:49] [INFO] (register_and_restart): Sleep of 20 seconds before reconnection ended. [2024-08-04_19:14:49] [INFO] (register_and_restart): Agent restart output: [2024-08-04_19:14:49] [INFO] (register_and_restart): Agent restarted. [2024-08-04_19:14:49] [INFO] (register_and_restart): Sleep of 60 seconds started.

It is clear that the messages appear in the context of agent restart


When there is an error in the log of the file state reading

2024-08-03 02:47:33,469 Couldn't access the file

There is no record (it would be the same as performing a skip of that record)

4.9.0,pre-release,2024-08-03 02:47:22,connected,2024-08-03 02:47:13,2024-08-03 02:47:13,293,149,0
4.9.0,pre-release,2024-08-03 02:47:27,connected,2024-08-03 02:47:13,2024-08-03 02:47:13,293,149,0
4.9.0,pre-release,2024-08-03 02:47:38,connected,2024-08-03 02:47:33,2024-08-03 02:47:33,0,3,0
4.9.0,pre-release,2024-08-03 02:47:44,connected,2024-08-03 02:47:43,2024-08-03 02:47:43,292,143,0
pro-akim commented 2 weeks ago

Deepening analysis

Manual testing

In 4.8.2 4 8 0

4.9.0 4 9 0

pro-akim commented 2 weeks ago

Stadistics

4.9.0

akim@akim-PC:~/Downloads/ossec_Test_stress_B5477_windows_2024-08-05$ cat monitor.log | grep 't access the file' | wc -l
248
akim@akim-PC:~/Downloads/OLD$ egrep "pending" Test_stress_B5477_windows_agentd_state.csv | wc -l
11
akim@akim-PC:~/Downloads/OLD$ egrep "connected" Test_stress_B5477_windows_agentd_state.csv | wc -l
38876
akim@akim-PC:~/Downloads/OLD$ cat Test_stress_B5477_windows_agentd_state.csv | wc -l
38888

4.8.1

akim@akim-PC:~/Downloads/ossec_Test_stress_B5394_windows_2024-07-08$ cat monitor.log | grep 't access the file' | wc -l
0
akim@akim-PC:~/Downloads/OLD$ egrep "pending" Test_stress_B5394_windows_agentd_state.csv | wc -l
1756
akim@akim-PC:~/Downloads/OLD$ egrep "connected" Test_stress_B5394_windows_agentd_state.csv | wc -l
37366
akim@akim-PC:~/Downloads/OLD$ cat Test_stress_B5394_windows_agentd_state.csv |wc -l
39123

Even if all the failed read cases are pending (248) it would not be similar to the number of pendings previously existing

Same results were found here: https://github.com/wazuh/wazuh/issues/25092#issuecomment-2311253523 The number of 'pendings' detected are lower than before.


Definitely, if I add a skip when the file is not there it will not generate relevant changes in the graph.

I believe that the persistence of pending has changed between 4.8.1 to 4.9.0 and this does not seem to have to do with the failure of the log due to the absence of the agent.state file but with the transition time between pending and connected being shortened.

In 4.8.1 The state is maintained connected in the state file if the agent is stopped. When restart is pressed, its state changes to pending and then changes to connected.

In 4.9.0 The state file disappears when the agent is stopped. (This would imply the possibility of losing connected and non-pending counts.) When restart is pressed, the file appears with a pending state and then changes to connected.

pro-akim commented 2 weeks ago

Testing

Running a test where if the file is absent the read and registration will be skipped:

https://ci.wazuh.info/job/Test_stress/5545/ B5545_agent_windows.tar.gz

akim@akim-PC:~/Downloads$ cat B5545_agent_windows/data/Test_stress_B5545_windows_agentd_state.csv | grep connected | wc -l
1285
akim@akim-PC:~/Downloads$ cat B5545_agent_windows/data/Test_stress_B5545_windows_agentd_state.csv | grep pending | wc -l
1
akim@akim-PC:~/Downloads$ cat B5545_agent_windows/data/Test_stress_B5545_windows_agentd_state.csv | wc -l
1287

akim@akim-PC:~/Downloads$ cat B5545_agent_windows/logs/ossec_Test_stress_B5545_windows_2024-08-28/monitor.log | grep Skipping
2024-08-28 16:02:58,826 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 16:16:24,161 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 16:32:30,431 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 16:39:13,023 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 16:52:38,838 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 17:08:45,116 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...

Example of the message in monitor.log

2024-08-28 16:16:24,161 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-28 16:16:24,161 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. There is not information to be added to csv file.

Absence of registry around 2024-08-28 16:16:24

image

Test_stress_B5545_windows_agentd_state_AgentD_Status


Test where there is a dynamic wait:

https://ci.wazuh.info/job/Test_stress/5548/ B5548_agent_windows.tar.gz

akim@akim-PC:~/Downloads$ cat B5548_agent_windows/data/Test_stress_B5548_windows_agentd_state.csv | grep connected | wc -l
1288
akim@akim-PC:~/Downloads$ cat B5548_agent_windows/data/Test_stress_B5548_windows_agentd_state.csv | grep pending | wc -l
1
akim@akim-PC:~/Downloads$ cat B5548_agent_windows/data/Test_stress_B5548_windows_agentd_state.csv | wc -l
1290
akim@akim-PC:~/Downloads$ cat B5548_agent_windows/logs/ossec_Test_stress_B5548_windows_2024-08-29/monitor.log | grep attemp
2024-08-29 00:04:01,957 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 00:16:06,755 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 00:28:11,783 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 00:33:33,558 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 00:59:04,139 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 01:11:08,989 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.

Evidence in monitor.log => the first log message must not have Skipping...

2024-08-29 00:03:59,944 Collecting resources usage of wazuh-agent.exe.
2024-08-29 00:03:59,944 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Skipping...
2024-08-29 00:04:01,957 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 2 seconds.
2024-08-29 00:04:01,957 Getting statistics data from C:\Program Files (x86)\ossec-agent\wazuh-agent.state
2024-08-29 00:04:01,957 Writing agentd.state info to Test_stress_B5548_windows_agentd_state.csv.

Registration of log after the established wait image

Test_stress_B5548_windows_agentd_state_AgentD_Status


Test where there is a dynamic wait + fix of the logs + waiting time reduction to 1 second per cycle + more agent types:

https://ci.wazuh.info/job/Test_stress/5549/

akim@akim-PC:~/Downloads$ cat B5549_agent_windows/data/Test_stress_B5549_windows_agentd_state.csv | grep connected | wc -l
1291
akim@akim-PC:~/Downloads$ cat B5549_agent_windows/data/Test_stress_B5549_windows_agentd_state.csv | grep pending | wc -l
1
akim@akim-PC:~/Downloads$ cat B5549_agent_windows/data/Test_stress_B5549_windows_agentd_state.csv | wc -l
1293
akim@akim-PC:~/Downloads$ cat B5549_agent_windows/logs/ossec_Test_stress_B5549_windows_2024-08-29/monitor.log | grep Waiting
2024-08-29 08:08:55,896 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 08:18:19,684 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 08:27:43,388 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 08:30:24,018 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 08:46:30,757 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 08:55:54,520 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 09:05:18,239 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
2024-08-29 09:14:42,001 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.
akim@akim-PC:~/Downloads$ cat B5549_agent_windows/logs/ossec_Test_stress_B5549_windows_2024-08-29/monitor.log | grep Waiting | wc -l
8

Evidence in monitor.log

2024-08-29 08:18:18,681 Collecting resources usage of wazuh-agent.exe.
2024-08-29 08:18:18,681 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist.
2024-08-29 08:18:19,684 File C:\Program Files (x86)\ossec-agent\wazuh-agent.state does not exist. Waiting attemp:1, 1 seconds.

The waiting time is practically imperceptible image

Test_stress_B5549_windows_agentd_state_AgentD_Status

pro-akim commented 2 weeks ago

Testing in AWS

Windows Server 2019 Datacenter 1809 (Build 17763.1999) c5a.2xlarge The c5a.2xlarge instance is in the compute optimized family with 8 vCPUs, 16.0 GiB

4,9,0 Build: https://ci.wazuh.info/job/Test_stress/5554/ The behaviour is different, almost no pending is shown in the screen (monitoring time 1s)

4 9 0performance

4.8.1 Build: https://ci.wazuh.info/job/Test_stress/5559/ 4 8 1performance

pro-akim commented 2 weeks ago

Testing windows 2019 Vagrant

Windows Server 2019 Datacenter Evaluation 1809 (Build 17763.1935), 4gb, 2 cores

4.8.1 4 8 1performancevagrant

4.9.0 4 9 0performancevagrant

pro-akim commented 2 weeks ago

Testing Wazuh 4.9.0 + Wazuh-jenkins 4.8.1:

Build: https://ci.wazuh.info/job/Test_stress/5558/ image

pro-akim commented 2 weeks ago

Testing windows 2019 Vagrant (same performance than AWS)

Windows Server 2019 Datacenter Evaluation 1809 (Build 17763.1935), 16 gb, 8 cores

4.8.1 4 8 1performancevagrant+

4.9.0 4 9 0performancevagrant+

pro-akim commented 2 weeks ago

Testing in Windows 2012 AWS

https://ci.wazuh.info/job/Test_stress/5563/ c5a.2xlarge The c5a.2xlarge instance is in the compute optimized family with 8 vCPUs, 16.0 GiB

win2012

pro-akim commented 2 weeks ago

Conclusion

After performing several tests on different versions of Windows, with different versions of Wazuh we can conclude that:

  1. The tests have not changed and different results are seen, since launching the tests of wazuh 4.9.0 with wazuh-jenkins 4.8.1 the same result was repeated again
  2. The tests on Vagrant Windows-Desktop-10 and Windows 2019 machines showed "expected" behaviors with the appearance of the pending state
  3. In AWS, in Windows 2019 the different behavior was observed to 4.8.1 while in Windows 2012 a behavior more similar to the "expected" was shown.
  4. The change is due to some change in performance in the agent that transitions extremely quickly or directly does not transition the pending state.

It may happen that since the state file is deleted and must be regenerated, there is a variation that causes the pending state to not be exposed for a sufficiently long time to be captured by the monitor in Windows 2019 on AWS.

The tests can be adapted to any circumstance, but I believe that there should be an analysis of why in certain OS, this transition is basically non-existent.

The following issue is opened for analysis.

These variations in the plots should be considered foreseeable stages.

fcaffieri commented 2 weeks ago

We need to analyze why this behavior difference occurs in Windows 2019 and not in the rest of Windows OS. Analyze if the changes made have any different impact on these systems, since as evidenced in the tests it is a specific case of Windows 2019. According to the analysis of the agent and the modifications made, focusing on the particular OS, with the conclusions obtained it will be possible to analyze whether it is necessary to modify the implementation of the tests or not. For this reason, the mentioned issue is created.

LGTM