Closed nico-stefani closed 1 week ago
Analyzing the artifacts we detect performance degradation in wazuh-db
after a cluster restart.
We assume authd couldn't empty all the client.keys
after the restart. So when an agent DELETE
was invoked the agent-sync
tasks after the event were significantly bigger.
For example:
2024/02/22 19:09:42 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B448_manager_25] [Agent-info sync] 32/32 chunks updated in wazuh-db in 0.085s.
2024/02/22 19:11:46 INFO: wazuh 172.31.59.143 "DELETE /agents" with parameters {"agents_list": "all", "status": "all", "older_than": "0s"} and body {} done in 6.961s: 200
2024/05/03 19:31:38 DEBUG: [Worker CLUSTER-Workload_benchmarks_metrics_B504_manager_2] [Agent-info sync] 24/24 chunks updated in wazuh-db in 2.258s.
The analyzed logs show that due to some internal behavior of wazuh-db
, the Agent-info sync
only took less than a second before making a large modification, such as the deletion of the 50k agents from the test, and after this began to take longer than expected, causing an increase in the duration of the mentioned task and a degradation in the cluster's performance.
LGTM
Description
During #23268 we detect a exceed in the thresholds of the
test_cluster_performance
test_cluster_performance.zip
We need to investigate the root cause of this problem before continuing to the next RC.
Checks
The following elements have been updated or reviewed (should also be checked if no modification is required):
api/test/integration/mapping/_test_mapping.py
).