Closed roronoasins closed 1 year ago
Moved to framework team.
This issue could be a duplicate of https://github.com/wazuh/wazuh/issues/18123.
There is one extra test failing in the artifacts uploaded here (test_jwt_invalidation/test_revoke_endpoint.py
) so we should wait for the other issue to be resolved and double-check before closing this one.
After running the test multiple times, it was found that the error is consistent and the message is always the same:
def check_result(self):
"""Check if a TimeoutError occurred."""
logger.debug(f'Checking results...')
while not self._queue.empty():
result = self._queue.get(block=True)
for host, msg in result.items():
if isinstance(msg, TimeoutError):
> raise msg
E TimeoutError: Did not found the expected callback in wazuh-master: .*Command received: b'cancel_task'.*
The error occurs on the _test_zip_sizelimit.
After reviewing the cluster.log of all components no error messages are found and after debugging the test it was found that the error occurs on the HostMonitor class when executing the run method.
And After debugging the Run method it was found that the Timeout occurs because it can't find an available handler for the file monitor, the loop gets stuck there until a timeout occurs.
The next step would be to replicate the behavior of the test locally to ensure where is the error if it's an error on the test or the cluster.
After further debugging the tests it was confirmed that the cluster.json
changes when running the tests as expected
Default config found inside cluster.json
before the tests:
"communication": {
"timeout_cluster_request": 20,
"timeout_dapi_request": 200,
"timeout_receiving_file": 120,
"max_zip_size": 1073741824,
"min_zip_size": 31457280,
"compress_level": 1,
"zip_limit_tolerance": 0.2
}
During the tests:
"communication": {
"timeout_cluster_request": 20,
"timeout_dapi_request": 200,
"timeout_receiving_file": 1,
"max_zip_size": 52428800,
"min_zip_size": 15728640,
"compress_level": 0,
"zip_limit_tolerance": 0.2
}
As defined in the test config:
---
timeout_receiving_file: 1
max_zip_size: 52428800 # 50 MB
min_zip_size: 15728640 # 15 MB
compress_level: 0
And trying to find the expected message cancel_task
in the manager cluster.log
only returns information from another module.
root@wazuh-master: /var/ossec/logs# grep 'cancel_task' cluster.log
2023/08/07 19:07:30 DEBUG: [Worker wazuh-worker1] [SendSync] Received request: b'wazuh-worker1*494570 {"daemon_name":"task-manager","message":{"origin":{"name":"wazuh-worker1","module":"upgrade_module"},"command":"upgrade_cancel_tasks","parameters":{}}}'
2023/08/07 19:07:52 DEBUG: [Worker wazuh-worker2] [SendSync] Received request: b'wazuh-worker2*104217 {"daemon_name":"task-manager","message":{"origin":{"name":"wazuh-worker2","module":"upgrade_module"},"command":"upgrade_cancel_tasks","parameters":{}}}'
2023/08/07 19:12:07 DEBUG: [Worker wazuh-worker1] [SendSync] Received request: b'wazuh-worker1*523620 {"daemon_name":"task-manager","message":{"origin":{"name":"wazuh-worker1","module":"upgrade_module"},"command":"upgrade_cancel_tasks","parameters":{}}}'
2023/08/07 19:12:28 DEBUG: [Worker wazuh-worker2] [SendSync] Received request: b'wazuh-worker2*926534 {"daemon_name":"task-manager","message":{"origin":{"name":"wazuh-worker2","module":"upgrade_module"},"command":"upgrade_cancel_tasks","parameters":{}}}'
The cancel_task is nowhere to be found meaning the signal is not being properly sent.
After further testing and debugging, it was found that the test fails due to the sync succeding instead of failing as the test expects, this can happen because of the re sources allocated when running locally.
Increasing the configuration max size limit from 50 MB --> 100 MB and increasing the number of files created from 5 --> 10. the timeout occurs and the cancel_task
is being properly sent from workers and received from master.
configuration.yaml
: ---
timeout_receiving_file: 1
max_zip_size: 104857600 # 100 MB
min_zip_size: 15728640 # 15 MB
compress_level: 0
test_integrity_sync
number of files increase to 10:big_filenames = {file_prefix + str(i) for i in range(10)}
cluster.log
2023/08/08 12:58:21 DEBUG: [Worker wazuh-worker2] [Main] Command received: b'cancel_task'
2023/08/08 12:58:25 DEBUG: [Worker wazuh-worker1] [Main] Command received: b'cancel_task'
(system-test-env) eduardoleon@pop-os:~/git/wazuh-qa/tests/system/test_cluster$ pytest test_integrity_sync/
==================================== test session starts =====================================
platform linux -- Python 3.9.16, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/eduardoleon/git/wazuh-qa/tests/system, configfile: pytest.ini
plugins: testinfra-5.0.0, metadata-2.0.4, html-3.1.1
collected 1 item
test_integrity_sync/test_integrity_sync.py . [100%]
=============================== 1 passed in 244.11s (0:04:04) ================================
(system-test-env) eduardoleon@pop-os:~/git/wazuh-qa/tests/system/test_cluster$ pytest test_integrity_sync/
==================================== test session starts =====================================
platform linux -- Python 3.9.16, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/eduardoleon/git/wazuh-qa/tests/system, configfile: pytest.ini
plugins: testinfra-5.0.0, metadata-2.0.4, html-3.1.1
collected 1 item
test_integrity_sync/test_integrity_sync.py . [100%]
=============================== 1 passed in 260.82s (0:04:20) ================================
(system-test-env) eduardoleon@pop-os:~/git/wazuh-qa/tests/system/test_cluster$ pytest test_integrity_sync/
==================================== test session starts =====================================
platform linux -- Python 3.9.16, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
rootdir: /home/eduardoleon/git/wazuh-qa/tests/system, configfile: pytest.ini
plugins: testinfra-5.0.0, metadata-2.0.4, html-3.1.1
collected 1 item
test_integrity_sync/test_integrity_sync.py . [100%]
=============================== 1 passed in 250.99s (0:04:10) ================================
Description
During the Pre-Alpha 1 system tests, it was found that the
test_cluster/test_integrity_sync/test_integrity_sync.py
system test for theagentless_cluster
environment has a flaky behavior.Sometimes a timeout error appears, and in other cases, a file that should be created is missing.
Evidences
integrity_sync_reports.zip