juliamagan commented 2 years ago

Description

This issue is part of the https://github.com/wazuh/wazuh-qa/issues/1516. After the research #1493, it has been detected fails and warnings in remoted tests for Wazuh manager. In order to sanitize current tests is necessary to fix remoted tests.

Test Execution - Results

Server	Local	Jenkins	Status
Test Execution 1	Results	Build	🔴
Test Execution 2	Results	Build	🔴
Test Execution 3	Results	Build	🔴

Settings

Packages details

Jenkins branch	QA branch
4.2	1516-4.2.0-full-green

Type	Format	Architecture	Version	Revision	Tag	File name
Server	rpm	x86_64	4.2.0	40207	v4.2.0-rc8	wazuh-manager-4.2.0-1.1493.x86_64.rpm

Environment

Provider	Box	OS	CPU	Memory
Vagrant	centos/8	CentOS 8	2	1024

Setup Local_internal_opcion.conf

remoted.debug=2
wazuh_database.interval=1
wazuh_db.commit_time=2
wazuh_db.commit_time_max=3
monitord.rotate_log=0

Development

This fix should be developed in a branch from 4.2, and it must be merged in the epic issue full-green branch 1516-4.2.0-full-green

Tests	Warnings
test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py	8
test_remoted/test_configuration/test_basic_configuration_connection_valid.py	8
test_remoted/test_configuration/test_basic_configuration_ipv6.py	2
test_remoted/test_configuration/test_basic_configuration_local_ip_valid.py	2
test_remoted/test_configuration/test_basic_configuration_queue_size_too_big.py	1
test_remoted/test_configuration/test_basic_configuration_queue_size_valid.py	3
test_remoted/test_configuration/test_basic_configuration_rids_closing_time_valid.py	8
test_remoted/test_configuration/test_basic_configuration_syslog_allowed_denied_ips_valid.py	5
test_remoted/test_configuration/test_basic_configuration_syslog_denied_ips.py	1

In order to finish this issue the following tasks should be fulfilled:

[x] Fix remoted integration tests. :
[x] Full Green/ Full Yellow in remoted tests for every supported OS.
[x] Merge 1530-full-yellow-remoted to 1516-4.2.0-full-green

damarisg commented 2 years ago

While the issues were being created, it was possible to see that the same errors message were showing in different tests. But the method involved was different, so when starting to work, you should synchronize with those involved in the different issues that affect the same error message.

I add the list of error messages related to multiple issues.

Error messages	Issues
`Could not find the remoted key loading log`	https://github.com/wazuh/wazuh-qa/issues/1572 - https://github.com/wazuh/wazuh-qa/issues/1578 - https://github.com/wazuh/wazuh-qa/issues/1576 - https://github.com/wazuh/wazuh-qa/issues/1574

damarisg commented 2 years ago

General Research: Done. It allowed creating each fail contained on this issue.

DProvinciani commented 2 years ago

I did a full run of the remoted tests on a branch that combines the changes on the https://github.com/wazuh/wazuh-qa/pull/1625 and those on the https://github.com/wazuh/wazuh-qa/pull/1624.

The progress went much further than it was before but there were failures in a test that in fact had no failure cases in https://github.com/wazuh/wazuh-qa/issues/1573. We are going to have to take a closer look at the tests test_remoted/test_configuration as there seem to be some flaky tests as well. For example:

test_basic_configuration_connection_invalid_protocol pass in my local environment but in issue https://github.com/wazuh/wazuh-qa/issues/1573 it is reported as failing.
test_basic_configuration_connection_valid pass in my local environment but in issue https://github.com/wazuh/wazuh-qa/issues/1573 it is reported as failing.
test_basic_configuration_ipv6 fails in my local environment but the issue https://github.com/wazuh/wazuh-qa/issues/1573 does not report any failures.

DProvinciani commented 2 years ago

Update including changes of #1624 and #1625

I performed full remoted tests executions in order to determine the global status of the tests of this component including the fixes in #1624 and #1625. I got some intermittent failures that should be analyzed.

Test Executions	Date	By	Status
test_remoted-r1.log	2021-07-22	Diego	:yellow_circle:
test_remoted-r2.log	2021-07-22	Diego	`test_syslog_message.py::test_syslog_message` :red_circle:
test_remoted-r3.log	2021-07-22	Diego	`test_agents_switching_protocols.py::test_agents_switching_protocols`:red_circle:

mdengra commented 2 years ago

2021-07-27

Used Wazuh-QA branch: 1570-test-socket-communication Test results with the default settings in the ossec.conf:

Test Executions	Date	By	Status
test_remoted_local_r1.log	2021-07-27	Miguel	:yellow_circle:
test_remoted_local_r2.log	2021-07-27	Miguel	:red_circle:
test_remoted_local_r3.log	2021-07-27	Miguel	:yellow_circle:

The following errors happened in the second running:

FAILED test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_56000]
FAILED test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP,UDP_56000]

The origin of both errors is the same:

                        logger.error(f"Results expected: {accum_results}")
>                   raise TimeoutError(error_message)
E                   TimeoutError: Could not find the expected event in queue socket

/usr/local/lib/python3.6/site-packages/wazuh_testing/tools/monitoring.py:456: TimeoutError

DProvinciani commented 2 years ago

2021-07-28 Update including changes of #1662

I performed full remoted tests executions in order to determine the global status of the tests of this component including the fixes in #1662. I got some intermittent failures that should be analyzed.

Test Executions	Date	By	Status
test_remoted-r1.log	2021-07-28	Diego	`test_agent_communication/test_multi_agent_protocols_communication.py` :red_circle:
test_remoted-r2.log	2021-07-28	Diego	:yellow_circle:
test_remoted-r3.log	2021-07-28	Diego	:yellow_circle:

damarisg commented 2 years ago

Update 28-07-21

After merging #1662 the fail test_agents_switching_protocols.py reported by Diego was fixed.

I show the results:

Test Executions	Date	By	Status
ResultsAgentCommunication.log	2021-07-28	Seyla	🟡
ResultsAgentCommunication2.log	2021-07-28	Seyla	🟡
ResultsAgentCommunication3.log	2021-07-28	Seyla	🟡

mdengra commented 2 years ago

2021-07-28

Used Wazuh-QA branch: 1530-full-yellow-remoted Test results with the default settings in the ossec.conf:

Test Executions	Date	By	Status
test_remoted_local_r1.log	2021-07-28	Miguel	:yellow_circle:
test_remoted_local_r2.log	2021-07-28	Miguel	:yellow_circle:
test_remoted_local_r3.log	2021-07-28	Miguel	:yellow_circle:

damarisg commented 2 years ago

Update 28-07-21

Test Results:

Test Executions	Date	By	Status
FullYellowRemoted.log	2021-07-28	Seyla	🔴 ( https://github.com/wazuh/wazuh-qa/issues/1573)
FullYellowRemoted2.log	2021-07-28	Seyla	:yellow_circle:
FullYellowRemoted3.log	2021-07-28	Seyla	🔴 (https://github.com/wazuh/wazuh-qa/issues/1575)

Note: Each test run alone works successfully and therefore is closed, but we can see that they failed in 2 random executions when I execute the Remoted cycle complete.

mdengra commented 2 years ago

2021-07-30

Used Wazuh-QA branch: 1571-fix-socket-error-in-ar-test Test results with the default settings in the ossec.conf:

Test Executions	Date	By	Status
test_remoted_local_r1.log	2021-07-30	Miguel	:red_circle:
test_remoted_local_r2.log	2021-07-30	Miguel	:red_circle:
test_remoted_local_r3.log	2021-07-30	Miguel	:red_circle:

The same test has failed in all rounds:

=========================== short test summary info ============================
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[Testing,UDP_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[TCP,Testing_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[TCP,UDP,Testing_secure_1514]
FAILED test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py::test_invalid_protocol[Testing,UDP_syslog_514]

Therefore I have run it standalone using the same branch:	Test Executions	Date	By
test_basic_configuration_connection_invalid_protocol.py_local_r1.log	2021-07-30	Miguel	:red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r2.log	2021-07-30	Miguel	:red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r3.log	2021-07-30	Miguel	:red_circle:

After confirming that errors also occur I ran it again using the `1530-full-yellow-remoted` branch:	Test Executions	API log	Date	By
test_basic_configuration_connection_invalid_protocol.py_local_r1.log	api_r1.log	2021-07-30	Miguel	:red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r2.log	api_r2.log	2021-07-30	Miguel	:red_circle:
test_basic_configuration_connection_invalid_protocol.py_local_r3.log	api_r3.log	2021-07-30	Miguel	:red_circle:

The failure is repeated every time. Although this test appears in the issue https://github.com/wazuh/wazuh-qa/issues/1573 (case 1), the error detected is different:

    def get_manager_configuration(section=None, field=None):
        """Get Wazuh manager configuration response from API using GET /manager/configuration
.
.
.   
        response = requests.get(api_query, headers=api_details['auth_headers'], verify=False)

>       assert response.json()['error'] == 0, f"Wazuh API response status different from 0: {response.json()}"
E       AssertionError: Wazuh API response status different from 0: {'title': 'Wazuh Internal Error', 'detail': 'Timeout executing API request', 'dapi_errors': {'node01': {'error': 'Timeout executing API request', 'logfile': 'WAZUH_HOME/logs/api.log'}}, 'error': 3021}

/usr/local/lib/python3.6/site-packages/wazuh_testing/api.py:128: AssertionError

Since the problem seems to be API related I attach these logs.

damarisg commented 2 years ago

Update 30-07-21

It is the last execution with all changes.

Test Executions	Date	By	Status
FullYellowRemoted.log	2021-07-30	Seyla	🟡
FullYellowRemoted2.log	2021-07-28	Seyla	🟡
FullYellowRemoted3.log	2021-07-28	Seyla	🟡

DProvinciani commented 2 years ago

2021-07-30

Used Wazuh-QA branch: 1530-full-yellow-remoted Test results with the default settings in the ossec.conf:

Test Executions	Date	By	Status
test_remoted-full_r1.log	2021-07-30	Diego	:red_circle:
test_remoted-full_r2.log	2021-07-30	Diego	:yellow_circle:
test_remoted-full_r3.log	2021-07-30	Diego	:yellow_circle:

The first round shows failures that should ve analyzed in the next tests:

FAILED wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514]
FAILED wazuh-qa/tests/integration/test_remoted/test_socket_communication/test_syslog_message.py::test_syslog_message[syslog_UDP_514-dummy_bad_formatted_pri_header]

juliamagan commented 2 years ago

I have launched three rounds of test_remoted/test_configuration/test_basic_configuration_connection_invalid_protocol.py and everything seems to work properly.

Tests results

CentOS 8 Manager	Reports
R1	🟡	Test results
R2	🟡	Test results
R3	🟡	Test results

damarisg commented 2 years ago

02-08-21

Used Wazuh-QA branch: 1530-full-yellow-remoted

Test Executions	By	Status
FullRemoted.log	Seyla	🔴
FullRemoted2.log	Seyla	🔴
FullRemoted3.log	Seyla	🔴

pereyra-m commented 2 years ago

2021-08-02

Wazuh-QA branch: 1530-full-yellow-remoted

The test test_remoted/test_socket_communication/test_syslog_message.py was run isolated on a clean environment many times, but all the tests were successful

Test Executions	Date	By	Status
test_syslog_message_1.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_2.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_3.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_4.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_5.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_6.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_7.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_8.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_9.log	2021-08-02	Matias	:yellow_circle:
test_syslog_message_10.log	2021-08-02	Matias	:yellow_circle:

Update

These are the results of the tests with the whole test_remoted folder.

Test Executions	Date	By	Status
test_remoted_1.log	2021-08-02	Matias	:red_circle:
test_remoted_2.log	2021-08-02	Matias	:red_circle:
test_remoted_3.log	2021-08-02	Matias	:yellow_circle:
test_remoted_4.log	2021-08-02	Matias	:yellow_circle:
test_remoted_5.log	2021-08-02	Matias	:yellow_circle:
test_remoted_6.log	2021-08-02	Matias	:yellow_circle:
test_remoted_7.log	2021-08-02	Matias	:yellow_circle:
test_remoted_8.log	2021-08-02	Matias	:red_circle:
test_remoted_9.log	2021-08-02	Matias	:red_circle:
test_remoted_10.log	2021-08-02	Matias	:yellow_circle:

test_remoted_1 failed in test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514]
test_remoted_2 failed in test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP_1514]
test_remoted_3 failed in test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP_1514]
test_remoted_9 failed in test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP,UDP_56000]

Update

The remoted folder was tested again, but this time without the -x parameter, so the test won't stop if it fails

Test Executions	Date	By	Status
test_remoted_rerun_1.log	2021-08-02	Matias	:red_circle:
test_remoted_rerun_2.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_3.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_4.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_5.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_6.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_7.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_8.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_9.log	2021-08-02	Matias	:yellow_circle:
test_remoted_rerun_10.log	2021-08-02	Matias	:yellow_circle:

test_remoted_rerun_1.log failed in test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514]

Rebits commented 2 years ago

03-08-21

Used Wazuh-QA branch: `1530-full-yellow-remoted`	Test Executions	Status
R1	🔴

DProvinciani commented 2 years ago

I spent some time analyzing the issues in test_multi_agent_protocols_communication.py reported in the next comments:

In all the executions in which we observe failures, they were caused because the Man In The Middle didn't find the event that it was looking for.

Having said that, I analyzed the MITM implementation and how it is used in these tests. Essentially for these tests the MITM will monitor the UDP socket queue/sockets/queue. The way in which it will monitor the queue will be:

Rename the queue to queue.original.
Read the queue socket from the MITM.

Considering the fact that analysisd also reads from that queue, the tests also stop the execution of that daemon durin the test execucion.

From this analysis we can make the next conclusions:

The MITM implementation doesn't work as a MITM in the UDP connections. It is just a change of the receiver.
If the queue receives many events, it could run into the same issue tan the FileMonitor in with it doesn't find the event before achieving the timeout.

We are analyzing the second item to see whether or not is the cause of the issue.

pereyra-m commented 2 years ago

03/08/21

The intermittent failures of test_multi_agents_protocols_communication may be related to a socket saturation, so the test is unable to read the required message.

Many different approaches are being carried out to read the queue socket (strace, kernel probes, etc), so we can probe the message is present but the test failed. Also, even more modules are being removed from the default configuration, to reduce the messages in the socket.

Update

The whole test_remoted folder was run with this new_ossec.conf file. Basically, all the modules are disabled and all the localfile entries were removed. The purpose of this approach is to reduce the socket usage and verify the tests behavior in this situation. Here are the results

Test Executions	Date	By	Status
test_remoted_no_logcollector_1.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_2.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_3.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_4.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_5.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_6.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_7.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_8.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_9.log	2021-08-03	Matias	:yellow_circle:
test_remoted_no_logcollector_10.log	2021-08-03	Matias	:yellow_circle:

The tests results suggest a more consistent result, but more testing is needed to be sure.

DProvinciani commented 2 years ago

03/08/21

Continuing with the analysis made in https://github.com/wazuh/wazuh-qa/issues/1530#issuecomment-891932778, I modified the Wazuh testing framework to print int the console the events received by the Man In The Middle from the queue socket. The changes are implemented in the next branch:

https://github.com/wazuh/wazuh-qa/tree/1618-poc-mitm-monitor

After this, I performed multiples executions of the tests in order to reproduce the issue and confirm the thinking that we have related to the QueueMonitor not analyzing the events as fast as possible to avoid the timeout. We didn't manage to reproduce the issue so, the next step will be to run the tests in multiples environments in paralell.

juliamagan commented 2 years ago

04-08-21

Used Wazuh-QA branch: `1530-full-yellow-remoted`	Test Executions	Status
R1	🔴
R2	🔴
R3	🔴

pereyra-m commented 2 years ago

04/08/21

These are the results of the whole remoted tests folder but disabling all the modules including logcollector. This is a re-run of the tests in this comment.

Test Executions	Date	By	Status
test_remoted_no_logcollector_rerun_1.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_2.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_3.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_4.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_5.log	2021-08-04	Matias	:red_circle:
test_remoted_no_logcollector_rerun_6.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_7.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_8.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_9.log	2021-08-04	Matias	:yellow_circle:
test_remoted_no_logcollector_rerun_10.log	2021-08-04	Matias	:yellow_circle:

These results may suggest that the socket saturation is a possible cause, but still we can't be sure.

Update

These tests were repeated 18 times, with the same configuration than before minus the authd daemon. It failed 2 times, here are the results: test_remoted_no_logcollector-auth.zip

We are not seeing an improvement in the overall results, we can discard the hypothesis of a socket saturation.

In the other hand, some extra debugging messages were added to the QA framework to analyze the messages that enter the queue and the messages that are read by the monitor.

The test test_multi_agent_protocols_communication.py was run 252 times, and it failed 39 times. Here we have the logs queue_test.zip. We can conclude that

Every time the test passes, both agents send the corresponding logs and both messages are extracted from the queue. Sometimes, one of them is repeated (this may be due to the sender, the agent simulator)
Every time the test fails, the missing message didn't enter the queue.

These results may indicate an issue in the message inyector and not in the QueueMonitor, because the queue is clearly not flooded.

Update

New debug messages were added to check if the sender threads were really sending the messages. The test test_multi_agent_protocols_communication.py was run 41 times, it failed 6 times. Here are the results multi_agent_debug.zip.

We can conclude that every time the test fails, both agent's threads send the corresponding messages. But they get lost in the way and they don't get to the socket. These results discard a failure in the sending threads.

DProvinciani commented 2 years ago

04/08/21

I performed full executions of the remoted tests with some messages printed in order to have more information in case of failures. Although the issue in the test_multi_agent_protocols_communication.py was not reproduced, the executions showed issues in the test test_syslog_message.py that should be analyzed.

Test Executions	Date	By	Status
remoted-full-r1.log	2021-08-04	Diego	:red_circle:
remoted-full-r2.log	2021-08-04	Diego	:red_circle:
remoted-full-r3.log	2021-08-04	Diego	:red_circle:
remoted-full-r4.log	2021-08-04	Diego	:red_circle:

damarisg commented 2 years ago

04-08-21

Test Executions	Date	By	Status
FullRemoted.log	2021-08-04	Seyla	:red_circle:
FullRemoted2.log	2021-08-04	Seyla	:red_circle:
FullRemoted3.log	2021-08-04	Seyla	🟡

damarisg commented 2 years ago

Full Execution - JENKINS

Results Full Remoted

Test Executions	Date	By	Status
Results1	05-08-21	Seyla	🟡
Results2	05-08-21	Seyla	🟡
Results3	05-08-21	Seyla	🟡

pereyra-m commented 2 years ago

05/08/21

Some debugging messages were added to remoted to verify the correct reception and sending of the test messages. Here are the results:

It was possible to see in the failed test that remoted receives both messages, but fails while trying to send one of them to the socket. That is the reason why the QA framework is unable to receive the message.

More tests are being carried out to determine if the ManInTheMiddle approach of the test is responsible for the socket communication error.

Details

These are the messages generated by the test. The threads send the message, but only one of them gets in the queue

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[UDP_1514] 2021-08-05T13:49:28.1628171368
2021/08/05 13:49:28 wazuh-remoted[547749] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/05 13:49:28 wazuh-remoted[547749] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/05 13:49:28 wazuh-remoted[547749] main.c:148 at main(): DEBUG: This is not a worker
2021-08-05 13:49:59,001 - wazuh_testing - ERROR - Could not find the expected event in queue socket
2021-08-05 13:49:59,002 - wazuh_testing - ERROR - Results accumulated: 0
2021-08-05 13:49:59,002 - wazuh_testing - ERROR - Results expected: 1
Agent simulator sending msg: b'!11075!#AES:\x18^\xe2\xdee\xc9S\x9f\xe2F\xafp\x1b\x9dw\x8c+\xdf@\xc9>\x10iv\xd5\xa4\xffR2~\x9f\x18Y\\\xe3\xd6\xf4\x9b\xad \r\xbd-\xdcD-\xc1\x07/\xd7\x0eg\xeb\x1fJ\xfd\x80\x95\xbe\xcf\x0f\xe2h\xb5\xe6\xf7\x92R\xad\xca.\x0c\x10\xb1!\x98\xb6\r\xba\xce\x02\n\x81\xcf\xa39\x91\xac\xf8P\xc8\x96\xb9\x93l\xe4\xbe*t\xcbU\x04\\6\xb4\xa3F\x06G\x88\xd2^\xc7\x96i\x89K\x81\xf1\xd1\xc1caS\xad1\xe8\x7f'
Agent simulator sending msg: b"!11076!#AES:\xbc\xa0\xd0\xb3^\xeaZl\x10el\xa1-\x058\xf3\xd0\x9b-\xc7\x83\x9b\\\xc2\xf2GC_\xf6\xbaB1\xb6\xd0x~`\xc9\xcc\xea\xcb\xf1\x04\x85\xa9{<\xfa\xa5\xdb\xa3\xbb\xed\x05{\xde\x16\xdf\xe3D\x10\xcf\x89H\xc2\xa4\xb8\x1f\xce\n\t\x8abBUo\x9d\r'9K\x91\xcdp{&\xa3\xc7\xd3\xa6\xe5\xc6\x03\xdf@\xa3\xfaW\xc2\x0ft,\x89\xae^\xb8\xda\xee\x8d0\x89\x93\x06\xd4(3\xdc-\xddT\xd7\xabI\xbe\xaf\xc2'\x96"
Inserting item in queue: b'1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076\x00'
Got msg from queue: b'1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076\x00'
FAILED

Were we have the messages printed by remoted related to the agents 11075 and 11076

2021/08/05 13:49:38 wazuh-remoted[547749] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11075) ([11075] (5-6400c8ef-debian7) any)
2021/08/05 13:49:38 wazuh-remoted[547749] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076) ([11076] (6-2948d7f5-debian7) any)
2021/08/05 13:49:38 wazuh-remoted[547749] os_net.c:506 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[11076] (6-2948d7f5-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 11076) (113)

The message for the agent 11075 is never sent to the queue.

Update

The tests were run again, but some extra debugging messages were added in mq_op.c, os_net.c and secure.c ( modified files.zip ). The ossec.log file was saved for every test.

It can be seen that the connection with the socket fails regularly, even in tests that pass. Here is an example: ossec_queue_2.log queue_test_2.log

The messages from the agent 15556 are received and sent right away

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556) ([15556] (2-6c8107e9-debian7) any)

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556) (113) 

2021/08/05 23:45:12 wazuh-remoted[791381] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

But the agent 15555 needs a re-connection to queue to send it successfully

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) ([15555] (1-39ae3654-debian7) any)

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:504 at OS_SendUnix(): DEBUG: Failed sending event to queue (OS_SOCKTERR): (1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) (113) 

2021/08/05 23:45:11 wazuh-remoted[791292] mq_op.c:93 at SendMSG(): ERROR: socketerr (not available).

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:556 at HandleSecureMessage(): ERROR: (1210): Queue 'queue/sockets/queue' not accessible: 'Connection refused'

2021/08/05 23:45:11 wazuh-remoted[791292] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

2021/08/05 23:45:11 wazuh-remoted[791292] secure.c:561 at HandleSecureMessage(): INFO: Successfully reconnected to 'queue/sockets/queue'

2021/08/05 23:45:11 wazuh-remoted[791292] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555) (113)

This behavior is found in every test, and this explains why some messages were found duplicated: it seems that the messages arrives regardless the errors

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP_1514] 2021/08/05 23:45:01 wazuh-remoted[791292] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/05 23:45:01 wazuh-remoted[791292] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/05 23:45:01 wazuh-remoted[791292] main.c:148 at main(): DEBUG: This is not a worker
Agent simulator sending msg: b'!15555!#AES:Z\x01\xaa\xda\xe1Y\x8bE\x0455\xa7y\x14\x87\xbeq\xdce\xffp\x9e\x7f\xc9\xcd\xb9<\x80\xb5\xbez\xc7\xcah\xf6\xcd\x04\xbc\xe6\xa8\x81\xf0\x9a\xec%Y\xfap\x9e\xf8\xee\x1ei\xbd\xfa L\x06Aqj;2l\x0ew*\xd4\x10\x1b/\xf4\xa5.\x8c6\xce\xdd"\x86@\x7fbZ]\x03\xac`\x9f\xd6\x08A\xf2j\x18\x9e\xbfS\xf8\x8f\x91\x9d\x07\\H\x06\xcf@\x8ak\t\xaa\xf17\xb1\xab\x86)]\xf13\xdd\xd8&\'|`\x15'
Inserting item in queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Agent simulator sending msg: b'!15556!#AES:\xe7\xe0\xe2\t\xccN\xfdg\x81\x1e\xdd\xb4)|\x1eOn\x94_\x13\x1a\xde\xd8(hk\x14\xd0\xcd\xac#\xa2.>\xbai\x90\xef\xd3>\xa6\xb2\xe9l\x16\x88\x96\xabHP\xaa\xe0\x06\xaa&\x851\x99\x1d\x1e\xb90\xd8\xc0\xab\x1b\xd4\x1cs)\x11\x93\xd0\xea,\xccY\xac=\xe7n\xb6G\xafv\x90\xecM\xb5VO\x9e\x84\x19\x10#\x01\xbe\xe9>\xecx~\xbc\x9a\x1f\\\x18j\xae\xf6\xbda\\\x08\xe8\xda\xd7"$a\xa3=\x90\xa1\xe3\x00\x16'
Inserting item in queue: b'1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556\x00'
Got msg from queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Got msg from queue: b'1:[15555] (1-39ae3654-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15555\x00'
Got msg from queue: b'1:[15556] (2-6c8107e9-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 15556\x00'
PASSED

Failed test case analysis

Analyzing a failed case (ossec_queue_87.log, queue_test_87.log) we can see that the message from the agent 16583 is found but the one sent by 16584 is not

wazuh-qa/tests/integration/test_remoted/test_agent_communication/test_multi_agent_protocols_communication.py::test_multi_agents_protocols_communication[TCP,UDP_1514] 2021-08-06T01:21:58.1628212918
2021/08/06 01:21:58 wazuh-remoted[841391] debug_op.c:70 at _log(): DEBUG: Logging module auto-initialized
2021/08/06 01:21:58 wazuh-remoted[841391] main.c:130 at main(): DEBUG: Wazuh home directory: /var/ossec
2021/08/06 01:21:58 wazuh-remoted[841391] main.c:148 at main(): DEBUG: This is not a worker
2021-08-06 01:22:38,684 - wazuh_testing - ERROR - Could not find the expected event in queue socket
2021-08-06 01:22:38,685 - wazuh_testing - ERROR - Results accumulated: 0
2021-08-06 01:22:38,685 - wazuh_testing - ERROR - Results expected: 1
Agent simulator sending msg: b'!16583!#AES:\x84\xfc\xfa\xa9Y\xf4\xa5EC\x80\x90U0\x05\xcaA\xb3\xafO\x96\xcd\xb9\xcf\xef\xc2\x998x7\x8d\x88\x97=\xc1\x04\xd8\xb2\x91C&\xd2\x87\x86c\xabOBH\xc1\x05\x18\x9f\x80\xaeq\xa5 \x0e\xd1XN\xd0\xf2^F\x07\x8d\xc7\xf7B*\xfc\\K\xc2f\xe7F\x17\xd1I\x9a\x12m\x04\x8c<\xea\xfa\x1f\xb4G\x18\x0c\xdc\xb2\rTZ\xfaL\xd0\xc7I\xdb\x83\'\x19\x9ek\x82~G\x80"\xb8\xdb#v\x19\xf3\x16^x)v\xaa\xbe'
Agent simulator sending msg: b'!16584!#AES:X\xd4\xb3\x1cA`\xd3t\xeb+\xeb\x9a\xee=!\xc0\x84\xaf\xd2xj!\xcaL\xbb\xf7\xc9G=\x84\x17\x016\x8f,\xd4\xe2]~1\xf1^\x9e \xd9\xab.\xd0C-\x834aF\xfa\x8b\xf3\xe8\xc5\x0fgk>\xea\x96\x1a\xb2A-\xc1\xc2*\x03\xfa\xa4\xdc\xac\x90U@m\x1f\xd2\xb4\xf1\xec\xdd\x16I\xca\x86%\x17\x0b\xcd,\x01\xe7\xa7#S\xd6\t\xd7\x136\x01\xe8\xb7Y~i\xb4\xc3\xb3d:\xd7GO\x00\xad=\xda>\xc8\x9c\x0e'
Inserting item in queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
Got msg from queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
Got msg from queue: b'1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583\x00'
FAILED

But the message to queue is sent by remoted

2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) ([16583] (9-27f5b5fb-debian7) any)
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:504 at OS_SendUnix(): DEBUG: Failed sending event to queue (OS_SOCKTERR): (1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) (113) 
2021/08/06 01:22:08 wazuh-remoted[841391] mq_op.c:93 at SendMSG(): ERROR: socketerr (not available).
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:547 at HandleSecureMessage(): DEBUG: Remoted received: (1:/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16584) ([16584] (10-5079b3c4-debian7) any)
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:556 at HandleSecureMessage(): ERROR: (1210): Queue 'queue/sockets/queue' not accessible: 'Connection refused'
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[16584] (10-5079b3c4-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16584) (114) 
2021/08/06 01:22:08 wazuh-remoted[841391] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts
2021/08/06 01:22:08 wazuh-remoted[841391] secure.c:561 at HandleSecureMessage(): INFO: Successfully reconnected to 'queue/sockets/queue'
2021/08/06 01:22:08 wazuh-remoted[841391] os_net.c:508 at OS_SendUnix(): DEBUG: Sending event to queue: (1:[16583] (9-27f5b5fb-debian7) any->/test.log:Feb 23 17:18:20 manager sshd[40657]: test message from agent 16583) (113) 
2021/08/06 01:22:39 wazuh-remoted[841560] mq_op.c:47 at StartMQ(): DEBUG: Connected succesfully to 'queue/sockets/queue' after 0 attempts

Conclusion

The queue interception is unreliable: sometimes the manager fails to send a message but it arrives anyway, and sometimes it successfully sends a message but it never arrives to the socket.

The best option is to change the test structure, and validate the message received by remoted with another method: the agent's status update, an alert generation, etc.

pereyra-m commented 2 years ago

A new issue was created with enhancement proposal #1690.

The debugging process of the failed tests would be much faster if the required information were easily available.

damarisg commented 2 years ago

TESTS EXECUTION

11-08-21

Test Executions	Date	By	Status	Fails
ResultPedroRemoted1.2.log	11-08-21	Seyla	🟡
ResultPedroRemoted1.1.log	11-08-21	Seyla	🟡
ResultPedroRemoted.log	11-08-21	Seyla	🟡
ResultPedroRemoted2.log	11-08-21	Seyla	🔴	`test_agent_communication/test_agent_version_shared_configuration_startup_message`
ResultPedroRemoted3.log	11-08-21	Seyla	🔴	`test_agent_communication/test_request_agent_info.py` (related to https://github.com/wazuh/wazuh-qa/issues/1698) `test_agent_communication/test_agent_version_shared_configuration_startup_message`
ResultPedroRemoted4.log	11-08-21	Seyla	🟡
FullRemoted.log	11-08-21	Seyla	🟢
Remoted2.log	11-08-21	Seyla	🔴	`test_socket_communication/test_syslog_message`
Remoted3.log	11-08-21	Seyla	🔴	`test_agent_communication/test_agent_version_shared_configuration_startup_message`

Note: Comments deleted was added bellow in a common table.

10-08-21

Test Executions	Date	By	Status
ResultFullRemoted.log	09-08-21	Seyla	🟡
ResultFullRemoted2.log	09-08-21	Seyla	🟡
ResultFullRemoted3.log	09-08-21	Seyla	🟡
ResultFullRemoted4.log	09-08-21	Diego	🟡
ResultFullRemoted5.log	09-08-21	Diego	🟡
ResultFullRemoted6.log	09-08-21	Diego	🟡
ResultFullRemoted7.log	09-08-21	Matias	🟡
ResultFullRemoted8.log	09-08-21	Matias	🟡
ResultFullRemoted9.log	09-08-21	Matias	🟡
ResultFullRemoted10.log	10-08-21	Seyla	🟡
ResultFullRemoted11.log	10-08-21	Seyla	🟡
ResultFullRemoted12.log	10-08-21	Seyla	🟡

08-09-21

All the remoted tests were run after merging the fix in #1692. All of them passed.

Test Executions	Date	By	Status
test_remoted-r1.log	09-08-21	Diego	:yellow_circle:
test_remoted-r2.log	09-08-21	Diego	:yellow_circle:
test_remoted-r3.log	09-08-21	Diego	:yellow_circle:
ResultFullRemoted.log	09-08-21	Seyla	🟡
ResultFullRemoted2.log	09-08-21	Seyla	🟡
ResultFullRemoted3.log	09-08-21	Seyla	🔴 `test_active_response_send_ar.py::test_active_response_ar_sending`
ResultFullRemoted4.log	09-08-21	Seyla	🔴 `test_multi_agent_status.py::test_protocols_communication` `test_basic_configuration_queue_size_valid.py::test_queue_size_valid`
ResultFullRemoted5.log	09-08-21	Seyla	🟡
ResultFullRemoted7.log	09-08-21	Seyla	🟡
ResultFullRemoted6.log	09-08-21	Seyla	🔴 `test_multi_agent_status.py::test_protocols_communication`
ResultFullRemoted8.log	09-08-21	Seyla	🔴 `test_active_response_send_ar.py::test_active_response_ar_sending`
ResultFullRemoted9.log	09-08-21	Seyla	🔴 `test_multi_agent_status.py::test_protocols_communication`

06-08-21

All the remoted tests were run after the merging the fix in #1692. All of them passed.

Test Executions	Date	By	Status
test_full_remoted_after_fix_1.log	2021-08-06	Matias	:yellow_circle:
test_full_remoted_after_fix_2.log	2021-08-06	Matias	:yellow_circle:
test_full_remoted_after_fix_3.log	2021-08-06	Matias	:yellow_circle:
test_full_remoted_after_fix_4.log	2021-08-06	Matias	:yellow_circle:
test_full_remoted_after_fix_5.log	2021-08-06	Matias	:yellow_circle:

snaow commented 2 years ago

Closed by https://github.com/wazuh/wazuh-qa/pull/1717

wazuh / wazuh-qa

Full Yellow Local - Fix Remoted tests in 4.2 #1530

Description

Test Execution - Results

Settings

Packages details

Environment

Setup Local_internal_opcion.conf

Development

Update including changes of #1624 and #1625

2021-07-27

2021-07-28 Update including changes of #1662

Update 28-07-21

2021-07-28

Update 28-07-21

2021-07-30

Update 30-07-21

2021-07-30

Tests results

02-08-21

2021-08-02

Update

Update

03-08-21

03/08/21

Update

03/08/21

04-08-21

04/08/21

Update

Update

04/08/21

04-08-21

Full Execution - JENKINS

Results Full Remoted

05/08/21

Details

Update

Failed test case analysis

Conclusion

TESTS EXECUTION

11-08-21

10-08-21

08-09-21

06-08-21