wazuh / wazuh-qa

Wazuh - Quality Assurance
GNU General Public License v2.0
61 stars 30 forks source link

Full Yellow Local - Fix Remoted tests in 4.2 #1530

Closed juliamagan closed 2 years ago

juliamagan commented 2 years ago

Description

This issue is part of the https://github.com/wazuh/wazuh-qa/issues/1516. After the research #1493, it has been detected fails and warnings in remoted tests for Wazuh manager. In order to sanitize current tests is necessary to fix remoted tests.

Test Execution - Results

Server Local Jenkins Status
Test Execution 1 Results Build 🔴
Test Execution 2 Results Build 🔴
Test Execution 3 Results Build 🔴

Settings

Packages details

Jenkins branch QA branch
4.2 1516-4.2.0-full-green
Type Format Architecture Version Revision Tag File name
Server rpm x86_64 4.2.0 40207 v4.2.0-rc8 wazuh-manager-4.2.0-1.1493.x86_64.rpm

Environment

Provider Box OS CPU Memory
Vagrant centos/8 CentOS 8 2 1024

Setup Local_internal_opcion.conf

remoted.debug=2
wazuh_database.interval=1
wazuh_db.commit_time=2
wazuh_db.commit_time_max=3
monitord.rotate_log=0

Development

This fix should be developed in a branch from 4.2, and it must be merged in the epic issue full-green branch 1516-4.2.0-full-green

Tests Warnings
test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py 8
test_remoted/test_configuration/test_basic_configuration_connection_valid.py 8
test_remoted/test_configuration/test_basic_configuration_ipv6.py 2
test_remoted/test_configuration/test_basic_configuration_local_ip_valid.py 2
test_remoted/test_configuration/test_basic_configuration_queue_size_too_big.py 1
test_remoted/test_configuration/test_basic_configuration_queue_size_valid.py 3
test_remoted/test_configuration/test_basic_configuration_rids_closing_time_valid.py 8
test_remoted/test_configuration/test_basic_configuration_syslog_allowed_denied_ips_valid.py 5
test_remoted/test_configuration/test_basic_configuration_syslog_denied_ips.py 1

In order to finish this issue the following tasks should be fulfilled:

damarisg commented 2 years ago

While the issues were being created, it was possible to see that the same errors message were showing in different tests. But the method involved was different, so when starting to work, you should synchronize with those involved in the different issues that affect the same error message.

I add the list of error messages related to multiple issues.

Error messages Issues
Could not find the remoted key loading log https://github.com/wazuh/wazuh-qa/issues/1572 - https://github.com/wazuh/wazuh-qa/issues/1578 - https://github.com/wazuh/wazuh-qa/issues/1576 - https://github.com/wazuh/wazuh-qa/issues/1574
damarisg commented 2 years ago

General Research: Done. It allowed creating each fail contained on this issue.

DProvinciani commented 2 years ago

I did a full run of the remoted tests on a branch that combines the changes on the https://github.com/wazuh/wazuh-qa/pull/1625 and those on the https://github.com/wazuh/wazuh-qa/pull/1624.

The progress went much further than it was before but there were failures in a test that in fact had no failure cases in https://github.com/wazuh/wazuh-qa/issues/1573. We are going to have to take a closer look at the tests test_remoted/test_configuration as there seem to be some flaky tests as well. For example:

image

DProvinciani commented 2 years ago

Update including changes of #1624 and #1625

I performed full remoted tests executions in order to determine the global status of the tests of this component including the fixes in #1624 and #1625. I got some intermittent failures that should be analyzed.

Test Executions Date By Status
test_remoted-r1.log 2021-07-22 Diego :yellow_circle:
test_remoted-r2.log 2021-07-22 Diego test_syslog_message.py::test_syslog_message :red_circle:
test_remoted-r3.log 2021-07-22 Diego test_agents_switching_protocols.py::test_agents_switching_protocols:red_circle:
mdengra commented 2 years ago

2021-07-27

Used Wazuh-QA branch: 1570-test-socket-communication Test results with the default settings in the ossec.conf:

Test Executions Date By Status
test_remoted_local_r1.log 2021-07-27 Miguel :yellow_circle:
test_remoted_local_r2.log 2021-07-27 Miguel :red_circle:
test_remoted_local_r3.log 2021-07-27 Miguel :yellow_circle:

The following errors happened in the second running:

FAILED test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_56000]
FAILED test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP,UDP_56000]

The origin of both errors is the same:

                        logger.error(f"Results expected: {accum_results}")
>                   raise TimeoutError(error_message)
E                   TimeoutError: Could not find the expected event in queue socket

/usr/local/lib/python3.6/site-packages/wazuh_testing/tools/monitoring.py:456: TimeoutError
DProvinciani commented 2 years ago

2021-07-28 Update including changes of #1662

I performed full remoted tests executions in order to determine the global status of the tests of this component including the fixes in #1662. I got some intermittent failures that should be analyzed.

Test Executions Date By Status
test_remoted-r1.log 2021-07-28 Diego test_agent_communication/test_multi_agent_protocols_communication.py :red_circle:
test_remoted-r2.log 2021-07-28 Diego :yellow_circle:
test_remoted-r3.log 2021-07-28 Diego :yellow_circle:
damarisg commented 2 years ago

Update 28-07-21

After merging #1662 the fail test_agents_switching_protocols.py reported by Diego was fixed.

I show the results:

Test Executions Date By Status
ResultsAgentCommunication.log 2021-07-28 Seyla 🟡
ResultsAgentCommunication2.log 2021-07-28 Seyla 🟡
ResultsAgentCommunication3.log 2021-07-28 Seyla 🟡
mdengra commented 2 years ago

2021-07-28

Used Wazuh-QA branch: 1530-full-yellow-remoted Test results with the default settings in the ossec.conf:

Test Executions Date By Status
test_remoted_local_r1.log 2021-07-28 Miguel :yellow_circle:
test_remoted_local_r2.log 2021-07-28 Miguel :yellow_circle:
test_remoted_local_r3.log 2021-07-28 Miguel :yellow_circle:
damarisg commented 2 years ago

Update 28-07-21

Test Results:

Test Executions Date By Status
FullYellowRemoted.log 2021-07-28 Seyla 🔴 ( https://github.com/wazuh/wazuh-qa/issues/1573)
FullYellowRemoted2.log 2021-07-28 Seyla :yellow_circle:
FullYellowRemoted3.log 2021-07-28 Seyla 🔴 (https://github.com/wazuh/wazuh-qa/issues/1575)

Note: Each test run alone works successfully and therefore is closed, but we can see that they failed in 2 random executions when I execute the Remoted cycle complete.

mdengra commented 2 years ago

2021-07-30

Used Wazuh-QA branch: 1571-fix-socket-error-in-ar-test Test results with the default settings in the ossec.conf:

Test Executions Date By Status
test_remoted_local_r1.log 2021-07-30 Miguel :red_circle:
test_remoted_local_r2.log 2021-07-30 Miguel :red_circle:
test_remoted_local_r3.log 2021-07-30 Miguel :red_circle:

The same test has failed in all rounds:

=========================== short test summary info ============================
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[Testing,UDP_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[TCP,Testing_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[TCP,UDP,Testing_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[Testing,UDP_syslog_514]
Therefore I have run it standalone using the same branch: Test Executions Date By Status
test_basic_configuration_connection_invalid_protocol.py_local_r1.log 2021-07-30 Miguel :red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r2.log 2021-07-30 Miguel :red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r3.log 2021-07-30 Miguel :red_circle:
After confirming that errors also occur I ran it again using the 1530-full-yellow-remoted branch: Test Executions API log Date By Status
test_basic_configuration_connection_invalid_protocol.py_local_r1.log api_r1.log 2021-07-30 Miguel :red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r2.log api_r2.log 2021-07-30 Miguel :red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r3.log api_r3.log 2021-07-30 Miguel :red_circle:

The failure is repeated every time. Although this test appears in the issue https://github.com/wazuh/wazuh-qa/issues/1573 (case 1), the error detected is different:

    def get_manager_configuration(section=None, field=None):
        """Get Wazuh manager configuration response from API using GET /manager/configuration
.
.
.   
        response = requests.get(api_query, headers=api_details['auth_headers'], verify=False)

>       assert response.json()['error'] == 0, f"Wazuh API response status different from 0: {response.json()}"
E       AssertionError: Wazuh API response status different from 0: {'title': 'Wazuh Internal Error', 'detail': 'Timeout executing API request', 'dapi_errors': {'node01': {'error': 'Timeout executing API request', 'logfile': 'WAZUH_HOME/logs/api.log'}}, 'error': 3021}

/usr/local/lib/python3.6/site-packages/wazuh_testing/api.py:128: AssertionError

Since the problem seems to be API related I attach these logs.

damarisg commented 2 years ago

Update 30-07-21

It is the last execution with all changes.

Test Executions Date By Status
FullYellowRemoted.log 2021-07-30 Seyla 🟡
FullYellowRemoted2.log 2021-07-28 Seyla 🟡
FullYellowRemoted3.log 2021-07-28 Seyla 🟡
DProvinciani commented 2 years ago

2021-07-30

Used Wazuh-QA branch: 1530-full-yellow-remoted Test results with the default settings in the ossec.conf:

Test Executions Date By Status
test_remoted-full_r1.log 2021-07-30 Diego :red_circle:
test_remoted-full_r2.log 2021-07-30 Diego :yellow_circle:
test_remoted-full_r3.log 2021-07-30 Diego :yellow_circle:

The first round shows failures that should ve analyzed in the next tests:

FAILED wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514]
FAILED wazuh-qa/tests/integration/test_remoted/test_socket_communication/test_syslog_message.py::test_syslog_message[syslog_UDP_514-dummy_bad_formatted_pri_header]
juliamagan commented 2 years ago

I have launched three rounds of test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py and everything seems to work properly.

Tests results

CentOS 8 Manager Reports
R1 🟡 Test results
R2 🟡 Test results
R3 🟡 Test results
damarisg commented 2 years ago

02-08-21

Used Wazuh-QA branch: 1530-full-yellow-remoted

Test Executions By Status
FullRemoted.log Seyla 🔴
FullRemoted2.log Seyla 🔴
FullRemoted3.log Seyla 🔴
pereyra-m commented 2 years ago

2021-08-02

Wazuh-QA branch: 1530-full-yellow-remoted

The test test_remoted/test_socket_communication/test_syslog_message.py was run isolated on a clean environment many times, but all the tests were successful

Test Executions Date By Status
test_syslog_message_1.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_2.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_3.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_4.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_5.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_6.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_7.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_8.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_9.log 2021-08-02 Matias :yellow_circle:
test_syslog_message_10.log 2021-08-02 Matias :yellow_circle:

Update

These are the results of the tests with the whole test_remoted folder.

Test Executions Date By Status
test_remoted_1.log 2021-08-02 Matias :red_circle:
test_remoted_2.log 2021-08-02 Matias :red_circle:
test_remoted_3.log 2021-08-02 Matias :yellow_circle:
test_remoted_4.log 2021-08-02 Matias :yellow_circle:
test_remoted_5.log 2021-08-02 Matias :yellow_circle:
test_remoted_6.log 2021-08-02 Matias :yellow_circle:
test_remoted_7.log 2021-08-02 Matias :yellow_circle:
test_remoted_8.log 2021-08-02 Matias :red_circle:
test_remoted_9.log 2021-08-02 Matias :red_circle:
test_remoted_10.log 2021-08-02 Matias :yellow_circle:

Update

The remoted folder was tested again, but this time without the -x parameter, so the test won't stop if it fails

Test Executions Date By Status
test_remoted_rerun_1.log 2021-08-02 Matias :red_circle:
test_remoted_rerun_2.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_3.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_4.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_5.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_6.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_7.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_8.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_9.log 2021-08-02 Matias :yellow_circle:
test_remoted_rerun_10.log 2021-08-02 Matias :yellow_circle:
Rebits commented 2 years ago

03-08-21

Used Wazuh-QA branch: 1530-full-yellow-remoted Test Executions Status
R1 🔴
DProvinciani commented 2 years ago

I spent some time analyzing the issues in test_multi_agent_protocols_communication.py reported in the next comments:

In all the executions in which we observe failures, they were caused because the Man In The Middle didn't find the event that it was looking for.

Having said that, I analyzed the MITM implementation and how it is used in these tests. Essentially for these tests the MITM will monitor the UDP socket queue/sockets/queue. The way in which it will monitor the queue will be:

  1. Rename the queue to queue.original.
  2. Read the queue socket from the MITM.

Considering the fact that analysisd also reads from that queue, the tests also stop the execution of that daemon durin the test execucion.

From this analysis we can make the next conclusions:

We are analyzing the second item to see whether or not is the cause of the issue.

pereyra-m commented 2 years ago

03/08/21

The intermittent failures of test_multi_agents_protocols_communication may be related to a socket saturation, so the test is unable to read the required message.

Many different approaches are being carried out to read the queue socket (strace, kernel probes, etc), so we can probe the message is present but the test failed. Also, even more modules are being removed from the default configuration, to reduce the messages in the socket.

Update

The whole test_remoted folder was run with this new_ossec.conf file. Basically, all the modules are disabled and all the localfile entries were removed. The purpose of this approach is to reduce the socket usage and verify the tests behavior in this situation. Here are the results

Test Executions Date By Status
test_remoted_no_logcollector_1.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_2.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_3.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_4.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_5.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_6.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_7.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_8.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_9.log 2021-08-03 Matias :yellow_circle:
test_remoted_no_logcollector_10.log 2021-08-03 Matias :yellow_circle:

The tests results suggest a more consistent result, but more testing is needed to be sure.

DProvinciani commented 2 years ago

03/08/21

Continuing with the analysis made in https://github.com/wazuh/wazuh-qa/issues/1530#issuecomment-891932778, I modified the Wazuh testing framework to print int the console the events received by the Man In The Middle from the queue socket. The changes are implemented in the next branch:

https://github.com/wazuh/wazuh-qa/tree/1618-poc-mitm-monitor

After this, I performed multiples executions of the tests in order to reproduce the issue and confirm the thinking that we have related to the QueueMonitor not analyzing the events as fast as possible to avoid the timeout. We didn't manage to reproduce the issue so, the next step will be to run the tests in multiples environments in paralell.

juliamagan commented 2 years ago

04-08-21

Used Wazuh-QA branch: 1530-full-yellow-remoted Test Executions Status
R1 🔴
R2 🔴
R3 🔴
pereyra-m commented 2 years ago

04/08/21

These are the results of the whole remoted tests folder but disabling all the modules including logcollector. This is a re-run of the tests in this comment.

Test Executions Date By Status
test_remoted_no_logcollector_rerun_1.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_2.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_3.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_4.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_5.log 2021-08-04 Matias :red_circle:
test_remoted_no_logcollector_rerun_6.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_7.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_8.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_9.log 2021-08-04 Matias :yellow_circle:
test_remoted_no_logcollector_rerun_10.log 2021-08-04 Matias :yellow_circle:

These results may suggest that the socket saturation is a possible cause, but still we can't be sure.

Update

These tests were repeated 18 times, with the same configuration than before minus the authd daemon. It failed 2 times, here are the results: test_remoted_no_logcollector-auth.zip

We are not seeing an improvement in the overall results, we can discard the hypothesis of a socket saturation.

In the other hand, some extra debugging messages were added to the QA framework to analyze the messages that enter the queue and the messages that are read by the monitor.

The test test_multi_agent_protocols_communication.py was run 252 times, and it failed 39 times. Here we have the logs queue_test.zip. We can conclude that

These results may indicate an issue in the message inyector and not in the QueueMonitor, because the queue is clearly not flooded.

Update

New debug messages were added to check if the sender threads were really sending the messages. The test test_multi_agent_protocols_communication.py was run 41 times, it failed 6 times. Here are the results multi_agent_debug.zip.

We can conclude that every time the test fails, both agent's threads send the corresponding messages. But they get lost in the way and they don't get to the socket. These results discard a failure in the sending threads.

DProvinciani commented 2 years ago

04/08/21

I performed full executions of the remoted tests with some messages printed in order to have more information in case of failures. Although the issue in the test_multi_agent_protocols_communication.py was not reproduced, the executions showed issues in the test test_syslog_message.py that should be analyzed.

Test Executions Date By Status
remoted-full-r1.log 2021-08-04 Diego :red_circle:
remoted-full-r2.log 2021-08-04 Diego :red_circle:
remoted-full-r3.log 2021-08-04 Diego :red_circle:
remoted-full-r4.log 2021-08-04 Diego :red_circle:
damarisg commented 2 years ago

04-08-21

Test Executions Date By Status
FullRemoted.log 2021-08-04 Seyla :red_circle:
FullRemoted2.log 2021-08-04 Seyla :red_circle:
FullRemoted3.log 2021-08-04 Seyla 🟡
damarisg commented 2 years ago

Full Execution - JENKINS

Results Full Remoted

Test Executions Date By Status
Results1 05-08-21 Seyla 🟡
Results2 05-08-21 Seyla 🟡
Results3 05-08-21 Seyla 🟡
pereyra-m commented 2 years ago

05/08/21

Some debugging messages were added to remoted to verify the correct reception and sending of the test messages. Here are the results:

It was possible to see in the failed test that remoted receives both messages, but fails while trying to send one of them to the socket. That is the reason why the QA framework is unable to receive the message.

More tests are being carried out to determine if the ManInTheMiddle approach of the test is responsible for the socket communication error.

Details

These are the messages generated by the test. The threads send the message, but only one of them gets in the queue

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514] 2021-08-05T13:49:28.1628171368
2021/08/05 13:49:28 wazuh-remoted[547749] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/05 13:49:28 wazuh-remoted[547749] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/05 13:49:28 wazuh-remoted[547749] main.c:148 at main(): DEBUG: This is not a worker
2021-08-05 13:49:59,001 - wazuh_testing - ERROR - Could not find the expected event in queue socket
2021-08-05 13:49:59,002 - wazuh_testing - ERROR - Results accumulated: 0
2021-08-05 13:49:59,002 - wazuh_testing - ERROR - Results expected: 1
Agent simulator sending msg: b'!11075!#AES:\x18^\xe2\xdee\xc9S\x9f\xe2F\xafp\x1b\x9dw\x8c+\xdf@\xc9>\x10iv\xd5\xa4\xffR2~\x9f\x18Y\\\xe3\xd6\xf4\x9b\xad \r\xbd-\xdcD-\xc1\x07/\xd7\x0eg\xeb\x1fJ\xfd\x80\x95\xbe\xcf\x0f\xe2h\xb5\xe6\xf7\x92R\xad\xca.\x0c\x10\xb1!\x98\xb6\r\xba\xce\x02\n\x81\xcf\xa39\x91\xac\xf8P\xc8\x96\xb9\x93l\xe4\xbe*t\xcbU\x04\\6\xb4\xa3F\x06G\x88\xd2^\xc7\x96i\x89K\x81\xf1\xd1\xc1caS\xad1\xe8\x7f'
Agent simulator sending msg: b"!11076!#AES:\xbc\xa0\xd0\xb3^\xeaZl\x10el\xa1-\x058\xf3\xd0\x9b-\xc7\x83\x9b\\\xc2\xf2GC_\xf6\xbaB1\xb6\xd0x~`\xc9\xcc\xea\xcb\xf1\x04\x85\xa9{<\xfa\xa5\xdb\xa3\xbb\xed\x05{\xde\x16\xdf\xe3D\x10\xcf\x89H\xc2\xa4\xb8\x1f\xce\n\t\x8abBUo\x9d\r'9K\x91\xcdp{&\xa3\xc7\xd3\xa6\xe5\xc6\x03\xdf@\xa3\xfaW\xc2\x0ft,\x89\xae^\xb8\xda\xee\x8d0\x89\x93\x06\xd4(3\xdc-\xddT\xd7\xabI\xbe\xaf\xc2'\x96"
Inserting item in queue: b'1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076\x00'
Got msg from queue: b'1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076\x00'
FAILED

Were we have the messages printed by remoted related to the agents 11075 and 11076

2021/08/05 13:49:38 wazuh-remoted[547749] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11075) ([11075] (5-6400c8ef-debian7) any)
2021/08/05 13:49:38 wazuh-remoted[547749] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076) ([11076] (6-2948d7f5-debian7) any)
2021/08/05 13:49:38 wazuh-remoted[547749] os_net.c:506 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076) (113) 

The message for the agent 11075 is never sent to the queue.

Update

The tests were run again, but some extra debugging messages were added in mq_op.c, os_net.c and secure.c ( modified files.zip ). The ossec.log file was saved for every test.

It can be seen that the connection with the socket fails regularly, even in tests that pass. Here is an example: ossec_queue_2.log queue_test_2.log

The messages from the agent 15556 are received and sent right away

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556) ([15556] (2-6c8107e9-debian7) any)

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556) (113) 

2021/08/05 23:45:12 wazuh-remoted[791381] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

But the agent 15555 needs a re-connection to queue to send it successfully

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) ([15555] (1-39ae3654-debian7) any)

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:504 at OS_SendUnix(): DEBUG: Failed sending event to queue (OS_SOCKTERR): (1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) (113) 

2021/08/05 23:45:11 wazuh-remoted[791292] mq_op.c:93 at SendMSG(): ERROR: socketerr (not available).

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:556 at HandleSecureMessage(): ERROR: (1210): Queue 'queue/sockets/queue' not accessible: 'Connection refused'

2021/08/05 23:45:11 wazuh-remoted[791292] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:561 at HandleSecureMessage(): INFO: Successfully reconnected to 'queue/sockets/queue'

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) (113) 

This behavior is found in every test, and this explains why some messages were found duplicated: it seems that the messages arrives regardless the errors

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP_1514] 2021/08/05 23:45:01 wazuh-remoted[791292] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/05 23:45:01 wazuh-remoted[791292] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/05 23:45:01 wazuh-remoted[791292] main.c:148 at main(): DEBUG: This is not a worker
Agent simulator sending msg: b'!15555!#AES:Z\x01\xaa\xda\xe1Y\x8bE\x0455\xa7y\x14\x87\xbeq\xdce\xffp\x9e\x7f\xc9\xcd\xb9<\x80\xb5\xbez\xc7\xcah\xf6\xcd\x04\xbc\xe6\xa8\x81\xf0\x9a\xec%Y\xfap\x9e\xf8\xee\x1ei\xbd\xfa L\x06Aqj;2l\x0ew*\xd4\x10\x1b/\xf4\xa5.\x8c6\xce\xdd"\x86@\x7fbZ]\x03\xac`\x9f\xd6\x08A\xf2j\x18\x9e\xbfS\xf8\x8f\x91\x9d\x07\\H\x06\xcf@\x8ak\t\xaa\xf17\xb1\xab\x86)]\xf13\xdd\xd8&\'|`\x15'
Inserting item in queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Agent simulator sending msg: b'!15556!#AES:\xe7\xe0\xe2\t\xccN\xfdg\x81\x1e\xdd\xb4)|\x1eOn\x94_\x13\x1a\xde\xd8(hk\x14\xd0\xcd\xac#\xa2.>\xbai\x90\xef\xd3>\xa6\xb2\xe9l\x16\x88\x96\xabHP\xaa\xe0\x06\xaa&\x851\x99\x1d\x1e\xb90\xd8\xc0\xab\x1b\xd4\x1cs)\x11\x93\xd0\xea,\xccY\xac=\xe7n\xb6G\xafv\x90\xecM\xb5VO\x9e\x84\x19\x10#\x01\xbe\xe9>\xecx~\xbc\x9a\x1f\\\x18j\xae\xf6\xbda\\\x08\xe8\xda\xd7"$a\xa3=\x90\xa1\xe3\x00\x16'
Inserting item in queue: b'1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556\x00'
Got msg from queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Got msg from queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Got msg from queue: b'1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556\x00'
PASSED

Failed test case analysis

Analyzing a failed case (ossec_queue_87.log, queue_test_87.log) we can see that the message from the agent 16583 is found but the one sent by 16584 is not

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP,UDP_1514] 2021-08-06T01:21:58.1628212918
2021/08/06 01:21:58 wazuh-remoted[841391] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/06 01:21:58 wazuh-remoted[841391] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/06 01:21:58 wazuh-remoted[841391] main.c:148 at main(): DEBUG: This is not a worker
2021-08-06 01:22:38,684 - wazuh_testing - ERROR - Could not find the expected event in queue socket
2021-08-06 01:22:38,685 - wazuh_testing - ERROR - Results accumulated: 0
2021-08-06 01:22:38,685 - wazuh_testing - ERROR - Results expected: 1
Agent simulator sending msg: b'!16583!#AES:\x84\xfc\xfa\xa9Y\xf4\xa5EC\x80\x90U0\x05\xcaA\xb3\xafO\x96\xcd\xb9\xcf\xef\xc2\x998x7\x8d\x88\x97=\xc1\x04\xd8\xb2\x91C&\xd2\x87\x86c\xabOBH\xc1\x05\x18\x9f\x80\xaeq\xa5 \x0e\xd1XN\xd0\xf2^F\x07\x8d\xc7\xf7B*\xfc\\K\xc2f\xe7F\x17\xd1I\x9a\x12m\x04\x8c<\xea\xfa\x1f\xb4G\x18\x0c\xdc\xb2\rTZ\xfaL\xd0\xc7I\xdb\x83\'\x19\x9ek\x82~G\x80"\xb8\xdb#v\x19\xf3\x16^x)v\xaa\xbe'
Agent simulator sending msg: b'!16584!#AES:X\xd4\xb3\x1cA`\xd3t\xeb+\xeb\x9a\xee=!\xc0\x84\xaf\xd2xj!\xcaL\xbb\xf7\xc9G=\x84\x17\x016\x8f,\xd4\xe2]~1\xf1^\x9e \xd9\xab.\xd0C-\x834aF\xfa\x8b\xf3\xe8\xc5\x0fgk>\xea\x96\x1a\xb2A-\xc1\xc2*\x03\xfa\xa4\xdc\xac\x90U@m\x1f\xd2\xb4\xf1\xec\xdd\x16I\xca\x86%\x17\x0b\xcd,\x01\xe7\xa7#S\xd6\t\xd7\x136\x01\xe8\xb7Y~i\xb4\xc3\xb3d:\xd7GO\x00\xad=\xda>\xc8\x9c\x0e'
Inserting item in queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
Got msg from queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
Got msg from queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
FAILED

But the message to queue is sent by remoted

2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) ([16583] (9-27f5b5fb-debian7) any)
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:504 at OS_SendUnix(): DEBUG: Failed sending event to queue (OS_SOCKTERR): (1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) (113) 
2021/08/06 01:22:08 wazuh-remoted[841391] mq_op.c:93 at SendMSG(): ERROR: socketerr (not available).
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16584) ([16584] (10-5079b3c4-debian7) any)
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:556 at HandleSecureMessage(): ERROR: (1210): Queue 'queue/sockets/queue' not accessible: 'Connection refused'
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[16584] (10-5079b3c4-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16584) (114) 
2021/08/06 01:22:08 wazuh-remoted[841391] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:561 at HandleSecureMessage(): INFO: Successfully reconnected to 'queue/sockets/queue'
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) (113) 
2021/08/06 01:22:39 wazuh-remoted[841560] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

Conclusion

The queue interception is unreliable: sometimes the manager fails to send a message but it arrives anyway, and sometimes it successfully sends a message but it never arrives to the socket.

The best option is to change the test structure, and validate the message received by remoted with another method: the agent's status update, an alert generation, etc.

pereyra-m commented 2 years ago

A new issue was created with enhancement proposal #1690.

The debugging process of the failed tests would be much faster if the required information were easily available.

damarisg commented 2 years ago

TESTS EXECUTION

11-08-21

Test Executions Date By Status Fails
ResultPedroRemoted1.2.log 11-08-21 Seyla 🟡
ResultPedroRemoted1.1.log 11-08-21 Seyla 🟡
ResultPedroRemoted.log 11-08-21 Seyla 🟡
ResultPedroRemoted2.log 11-08-21 Seyla 🔴 test_agent_communication/test_agent_version_shared_configuration_startup_message
ResultPedroRemoted3.log 11-08-21 Seyla 🔴 test_agent_communication/test_request_agent_info.py (related to https://github.com/wazuh/wazuh-qa/issues/1698) test_agent_communication/test_agent_version_shared_configuration_startup_message
ResultPedroRemoted4.log 11-08-21 Seyla 🟡
FullRemoted.log 11-08-21 Seyla 🟢
Remoted2.log 11-08-21 Seyla 🔴 test_socket_communication/test_syslog_message
Remoted3.log 11-08-21 Seyla 🔴 test_agent_communication/test_agent_version_shared_configuration_startup_message

Note: Comments deleted was added bellow in a common table.

10-08-21

Test Executions Date By Status
ResultFullRemoted.log 09-08-21 Seyla 🟡
ResultFullRemoted2.log 09-08-21 Seyla 🟡
ResultFullRemoted3.log 09-08-21 Seyla 🟡
ResultFullRemoted4.log 09-08-21 Diego 🟡
ResultFullRemoted5.log 09-08-21 Diego 🟡
ResultFullRemoted6.log 09-08-21 Diego 🟡
ResultFullRemoted7.log 09-08-21 Matias 🟡
ResultFullRemoted8.log 09-08-21 Matias 🟡
ResultFullRemoted9.log 09-08-21 Matias 🟡
ResultFullRemoted10.log 10-08-21 Seyla 🟡
ResultFullRemoted11.log 10-08-21 Seyla 🟡
ResultFullRemoted12.log 10-08-21 Seyla 🟡

08-09-21

All the remoted tests were run after merging the fix in #1692. All of them passed.

Test Executions Date By Status
test_remoted-r1.log 09-08-21 Diego :yellow_circle:
test_remoted-r2.log 09-08-21 Diego :yellow_circle:
test_remoted-r3.log 09-08-21 Diego :yellow_circle:
ResultFullRemoted.log 09-08-21 Seyla 🟡
ResultFullRemoted2.log 09-08-21 Seyla 🟡
ResultFullRemoted3.log 09-08-21 Seyla 🔴 test_active_response_send_ar.py::test_active_response_ar_sending
ResultFullRemoted4.log 09-08-21 Seyla 🔴 test_multi_agent_status.py::test_protocols_communication test_basic_configuration_queue_size_valid.py::test_queue_size_valid
ResultFullRemoted5.log 09-08-21 Seyla 🟡
ResultFullRemoted7.log 09-08-21 Seyla 🟡
ResultFullRemoted6.log 09-08-21 Seyla 🔴 test_multi_agent_status.py::test_protocols_communication
ResultFullRemoted8.log 09-08-21 Seyla 🔴 test_active_response_send_ar.py::test_active_response_ar_sending
ResultFullRemoted9.log 09-08-21 Seyla 🔴 test_multi_agent_status.py::test_protocols_communication

06-08-21

All the remoted tests were run after the merging the fix in #1692. All of them passed.

Test Executions Date By Status
test_full_remoted_after_fix_1.log 2021-08-06 Matias :yellow_circle:
test_full_remoted_after_fix_2.log 2021-08-06 Matias :yellow_circle:
test_full_remoted_after_fix_3.log 2021-08-06 Matias :yellow_circle:
test_full_remoted_after_fix_4.log 2021-08-06 Matias :yellow_circle:
test_full_remoted_after_fix_5.log 2021-08-06 Matias :yellow_circle:
snaow commented 2 years ago

Closed by https://github.com/wazuh/wazuh-qa/pull/1717