wazuh / wazuh

Wazuh - The Open Source Security Platform. Unified XDR and SIEM protection for endpoints and cloud workloads.
https://wazuh.com/
Other
9.34k stars 1.48k forks source link

Investigate error in `Agent-info sync` task #23333

Closed nico-stefani closed 5 days ago

nico-stefani commented 1 week ago

Description

During #23268 we detect a recurrent error in the worker node logs:

2024/05/03 19:35:59 ERROR: [Worker CLUSTER-Workload_benchmarks_metrics_B504_manager_8] [Agent-info sync] Finished in 7.679s. Updated 3 chunks. There were 6 chunks with errors: ['Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed', 'Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed', 'Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed', 'Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed', 'Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed', 'Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed']

It is needed to investigate the root cause of the problem before continuing to the next RC.

Checks

The following elements have been updated or reviewed (should also be checked if no modification is required):

fdalmaup commented 1 week ago

Issue Update

The error was reproduced in a v4.8.0-rc1 environment with 2 Wazuh manager nodes and an agent when the latter is deleted via the DELETE /agents request:

$ curl -k -X DELETE "https://localhost:55555/agents?agents_list=001&status=all&older_than=0s" -H "Authorization: Bearer $TOKEN" | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   168  100   168    0     0   5090      0 --:--:-- --:--:-- --:--:--  5090
{
  "data": {
    "affected_items": [
      "001"
    ],
    "total_affected_items": 1,
    "total_failed_items": 0,
    "failed_items": []
  },
  "message": "All selected agents were deleted",
  "error": 0
}

api.log

2024/05/08 16:09:22 INFO: wazuh-wui 172.26.0.1 "DELETE /agents" with parameters {"agents_list": "002", "status": "all", "older_than": "0s"} and body {} done in 0.010s: 200

When the cluster is in the process of completing the Agent-info sync task, the error arises:

Master cluster.log

2024/05/08 16:09:28 DEBUG: [Worker f8a7c0e28eb8] [Main] Command received: b'syn_a_w_m_p'
2024/05/08 16:09:28 DEBUG: [Worker f8a7c0e28eb8] [Main] Command received: b'new_str'
2024/05/08 16:09:28 DEBUG: [Worker f8a7c0e28eb8] [Main] Command received: b'str_upd'
2024/05/08 16:09:28 DEBUG: [Worker f8a7c0e28eb8] [Main] Command received: b'syn_a_w_m'
2024/05/08 16:09:28 INFO: [Worker f8a7c0e28eb8] [Agent-info sync] Starting.
2024/05/08 16:09:28 DEBUG2: [Worker f8a7c0e28eb8] [Agent-info sync] Chunk 1/1: [{"id":2,"name":"f03f6fab78d9","ip":"172.26.0.5","os_name":"Ubuntu","os_version":"20.04.6 LTS","os_major":"20","os_minor":"04","os_codename":"Focal Fossa","os_platform":"ubuntu","os_uname":"Linux |f03f6fab78d9 |5.15.0-105-generic |#115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 |x86_64","os_arch":"x86_64","version":"Wazuh v4.8.0","config_sum":"ab73af41699f13fdd81903b5f23d8d00","merged_sum":"4a8724b20dee0124ff9656783c490c4e","manager_host":"f8a7c0e28eb8","node_name":"f8a7c0e28eb8","last_keepalive":1715177363,"connection_status":"active","disconnection_time":0,"group_config_status":"synced","status_code":0,"labels":[{"id":2,"key":"#\"_agent_ip\"","value":"172.26.0.5"},{"id":2,"key":"#\"_manager_hostname\"","value":"f8a7c0e28eb8"},{"id":2,"key":"#\"_node_name\"","value":"f8a7c0e28eb8"},{"id":2,"key":"#\"_wazuh_version\"","value":"Wazuh v4.8.0"}]}]
2024/05/08 16:09:28 ERROR: [Worker f8a7c0e28eb8] [Agent-info sync] Wazuh-db response for chunk 1/1 was not "ok": Error 2003 - Error in wazuhdb request: Cannot execute Global database query; FOREIGN KEY constraint failed
2024/05/08 16:09:28 DEBUG: [Worker f8a7c0e28eb8] [Agent-info sync] 0/1 chunks updated in wazuh-db in 0.001s.
2024/05/08 16:09:28 INFO: [Worker f8a7c0e28eb8] [Agent-info sync] Finished in 0.006s. Updated 0 chunks.

Given that the test environment has the debug mode enabled, the ossec.log file provides further information regarding what failed:

2024/05/08 16:09:28 wazuh-db[142] wdb_parser.c:859 at wdb_parse(): DEBUG: Global query: sync-agent-groups-get {"condition": "sync_status", "set_synced": true, "get_global_hash": true}
2024/05/08 16:09:28 wazuh-db[142] wdb_integrity.c:759 at wdb_get_global_group_hash(): DEBUG: No group hash was found to calculate the global group hash.
2024/05/08 16:09:28 wazuh-db[142] main.c:335 at run_dealer(): DEBUG: New client connected (19).
2024/05/08 16:09:28 wazuh-db[142] wdb_parser.c:859 at wdb_parse(): DEBUG: Global query: sync-agent-info-set [{"id":2,"name":"f03f6fab78d9","ip":"172.26.0.5","os_name":"Ubuntu","os_version":"20.04.6 LTS","os_major":"20","os_minor":"04","os_codename":"Focal Fossa","os_platform":"ubuntu","os_uname":"Linux |f03f6fab78d9 |5.15.0-105-generic |#115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 |x86_64","os_arch":"x86_64","version":"Wazuh v4.8.0","config_sum":"ab73af41699f13fdd81903b5f23d8d00","merged_sum":"4a8724b20dee0124ff9656783c490c4e","manager_host":"f8a7c0e28eb8","node_name":"f8a7c0e28eb8","last_keepalive":1715177363,"connection_status":"active","disconnection_time":0,"group_config_status":"synced","status_code":0,"labels":[{"id":2,"key":"#\"_agent_ip\"","value":"172.26.0.5"},{"id":2,"key":"#\"_manager_hostname\"","value":"f8a7c0e28eb8"},{"id":2,"key":"#\"_node_name\"","value":"f8a7c0e28eb8"},{"id":2,"key":"#\"_wazuh_version\"","value":"Wazuh v4.8.0"}]}]
2024/05/08 16:09:28 wazuh-db[142] wdb.c:1112 at wdb_exec_stmt_silent(): DEBUG: SQL statement execution failed
2024/05/08 16:09:28 wazuh-db[142] wdb_parser.c:5997 at wdb_parse_global_sync_agent_info_set(): DEBUG: Global DB Cannot execute SQL query; err database queue/db/global.db: FOREIGN KEY constraint failed

Comparing the differences between the previous stage (v4.8.0-beta6) and the one having the error (v4.8.0-rc1), the following was found:

diff --git a/src/wazuh_db/schema_global.sql b/src/wazuh_db/schema_global.sql
index 75fcfb7bb8..4fc5331eca 100644
--- a/src/wazuh_db/schema_global.sql
+++ b/src/wazuh_db/schema_global.sql
@@ -49,7 +49,7 @@ CREATE INDEX IF NOT EXISTS agent_group_hash ON agent (group_hash);
 INSERT INTO agent (id, ip, register_ip, name, date_add, last_keepalive, `group`, connection_status) VALUES (0, '127.0.0.1', '127.0.0.1', 'localhost', strftime('%s','now'), 253402300799, NULL, 'active');

 CREATE TABLE IF NOT EXISTS labels (
-    id INTEGER,
+    id INTEGER REFERENCES agent (id) ON DELETE CASCADE,
     key TEXT NOT NULL,
     value TEXT NOT NULL,
     PRIMARY KEY (id,key)
@@ -84,4 +84,4 @@ CREATE TABLE IF NOT EXISTS metadata (
     value TEXT
 );

-INSERT INTO metadata (key, value) VALUES ('db_version', '5');
+INSERT INTO metadata (key, value) VALUES ('db_version', '6');

This change might generate the error when the agent is deleted and consequently, the associated labels from the table.

The DELETE /agents request does not directly delete the agents from the database, instead, it sends a command to the authd process to make the deletion.

Conclusion

The error does not originate in the cluster itself, it might be related to the new constraint added to the labels table and the id field. The corresponding issue #23349 was opened for the team in charge to analyze what could be the problem.

GGP1 commented 1 week ago

Review

The error was reproduced and the root of the error identified. The failure was caused by a change in the database schema owned by the Core team and whose access from the framework is through wazuh-authd.

Opening an issue so that the corresponding team can solve it is the correct thing to do. LGTM!

Selutario commented 1 week ago

Good catch @fdalmaup. This change was included in v4.7.4 and then merged to 4.8.0.

davidjiglesias commented 1 week ago

Based on the conclusions from https://github.com/wazuh/wazuh/issues/23349#top, we should add a mechanism to ignore this specific exception from the db.

fdalmaup commented 1 week ago

Issue Update

The mechanism to ignore the exception in the send_data_to_wdb method was applied. When forcing the error, the ossec.log file reports it:

2024/05/10 12:39:18 wazuh-db[147] wdb_parser.c:859 at wdb_parse(): DEBUG: Global query: sync-agent-info-set [{"id":1,"name":"f77e92dad2f2","ip":"172.18.0.5","os_name":"Ubuntu","os_version":"20.04.6 LTS","os_major":"20","os_minor":"04","os_codename":"Focal Fossa","os_platform":"ubuntu","os_uname":"Linux |f77e92dad2f2 |5.15.0-105-generic |#115~20.04.1-Ubuntu SMP Mon Apr 15 17:33:04 UTC 2024 |x86_64","os_arch":"x86_64","version":"Wazuh v4.8.0","config_sum":"ab73af41699f13fdd81903b5f23d8d00","merged_sum":"4a8724b20dee0124ff9656783c490c4e","manager_host":"9a9bd3dcf169","node_name":"9a9bd3dcf169","last_keepalive":1715337557,"connection_status":"active","disconnection_time":0,"group_config_status":"synced","status_code":0,"labels":[{"id":1,"key":"#\"_agent_ip\"","value":"172.18.0.5"},{"id":1,"key":"#\"_manager_hostname\"","value":"9a9bd3dcf169"},{"id":1,"key":"#\"_node_name\"","value":"9a9bd3dcf169"},{"id":1,"key":"#\"_wazuh_version\"","value":"Wazuh v4.8.0"}]}]
2024/05/10 12:39:18 wazuh-db[147] wdb.c:1112 at wdb_exec_stmt_silent(): DEBUG: SQL statement execution failed
2024/05/10 12:39:18 wazuh-db[147] wdb_parser.c:5997 at wdb_parse_global_sync_agent_info_set(): DEBUG: Global DB Cannot execute SQL query; err database queue/db/global.db: FOREIGN KEY constraint failed
2024/05/10 12:39:18 wazuh-db[147] main.c:397 at run_worker(): DEBUG: Client 33 disconnected.
2024/05/10 12:39:18 wazuh-db[147] wdb_parser.c:859 at wdb_parse(): DEBUG: Global query: sync-agent-groups-get {"condition": "sync_status", "set_synced": true, "get_global_hash": true}
2024/05/10 12:39:18 wazuh-db[147] wdb_integrity.c:759 at wdb_get_global_group_hash(): DEBUG: No group hash was found to calculate the global group hash.

Nevertheless, it is no longer raised in the cluster.log file:

2024/05/10 12:39:07 DEBUG: [Master] [SendSync] Receiving SendSync request (remoted) from 9a9bd3dcf169 (535832)
2024/05/10 12:39:08 DEBUG: [Worker 9a9bd3dcf169] [Main] Command received: b'syn_a_w_m_p'
2024/05/10 12:39:08 DEBUG: [Worker 9a9bd3dcf169] [Main] Command received: b'new_str'
2024/05/10 12:39:08 DEBUG: [Worker 9a9bd3dcf169] [Main] Command received: b'str_upd'
2024/05/10 12:39:08 DEBUG: [Worker 9a9bd3dcf169] [Main] Command received: b'syn_a_w_m'
2024/05/10 12:39:08 INFO: [Worker 9a9bd3dcf169] [Agent-info sync] Starting.
2024/05/10 12:39:08 DEBUG: [Worker 9a9bd3dcf169] [Agent-info sync] 0/1 chunks updated in wazuh-db in 0.000s.
2024/05/10 12:39:08 INFO: [Worker 9a9bd3dcf169] [Agent-info sync] Finished in 0.003s. Updated 0 chunks.