Closed fcaffieri closed 4 weeks ago
I've tested the workflow's thread parameter with two workflows. I've used Visual Studio code to launch the workflow_engine.
To run this workflow, I've used the branch fix/5197-logger-config
to get a cleaner log.
1 Manager | N Agent | Allocation (Time) | Provision (Time) | Testing (Time) | Cleanup (Time) | Total (Time) |
---|---|---|---|---|---|---|
1 | 12 | 00:46 | 05:43 | 03:20 | 01:41 | 11:11 |
Sometimes, when I run workflows, whether I use the threads parameter or not, the allocation and clean-up tasks are executed OK, but the workflow does not perform the provision and the testing tasks. Here's a log file that shows the execution:
I could not find a pattern to reproduce the behavior.
@mhamra you could execute an execution with the following input and parameters:
I've run the workflow file provided by @fcaffieri with the option --threads
and without it.
Here are the results | file | threads 1 | threads 12 |
---|---|---|---|
command console output | not saved | console-log-threads-12.txt | |
workflow log file | workflow-threads-1.log | workflow-threads-12.log | |
dry-run log | workflow-dry-run-1.log | workflow-dry-run-12.log |
The dry-run log file comparison shows that the dependencies are equivalent. All testing tasks have these dependencies:
test agent -> allocate agent -> provision manager -> allocate manager. The files only differ in the order of execution of the test tasks.
run-agent-linux-debian-10-amd64-tests
started at 20:10:42,948, 4:42 minutes after the manager provision task. All the other run-agent tasks began almost at the same time.run-agent-linux-oracle-9-amd64-tests
does not appear in the 12-thread workflow log file but in the one-thread workflow log file because the Oracle Linux agent allocation failed.The manager provisioning task runs the initialization script and exits without waiting for the manager installation to finish. While this approach is good because it allows the continuation of other tasks in parallel, the agent test should start after the manager installation finishes. To check the manager installation is finished and the manager is up and running, I think that we can implement one of these options:
I've modified the workflow, adding a task that waits 5 minutes after provisioning the manager. All the tests depend on that task and the agent allocation.
file | threads 12 with wait |
---|---|
command console output | console-log-threads-12-with-wait.txt |
workflow log file | workflow-threads-12-with-wait.log |
dry-run log | workflow-dry-run-12-with-wait.log |
After adding a delay task, some agent tests finished well, but others failed. The delay may have been shorter than needed to finish the manager installation.
Analyzing the executions and the logs, I notice that the problem occurs because the provision cannot connect to the VM that raised the allocation, throwing the error:
TASK [Gathering Facts] *********************************************************
[1;31mfatal: [ec2-3-80-51-50.compute-1.amazonaws.com]: UNREACHABLE! => changed=false [0m
[1;31m msg: 'Failed to connect to the host via ssh: ssh: connect to host ec2-3-80-51-50.compute-1.amazonaws.com port 2200: Connection refused'[0m
[1;31m unreachable: true[0m
PLAY RECAP *********************************************************************
[0;31mec2-3-80-51-50.compute-1.amazonaws.com[0m : ok=0 changed=0 [1;31munreachable=1 [0m failed=0 skipped=0 rescued=0 ignored=0
[37m[2024-04-18 20:06:00] [DEBUG] PROVISIONER: Playbook {'hosts': 'ec2-3-80-51-50.compute-1.amazonaws.com', 'become': True, 'gather_facts': True, 'tasks': [{'name': 'Install the required packages', 'shell': ''}, {'name': 'Download the Wazuh installation assistant', 'shell': 'curl -sO https://packages.wazuh.com/4.7/wazuh-install.sh'}, {'name': 'Install wazuh-manager with assistant', 'shell': 'bash ./wazuh-install.sh -a -i'}]} finished with status {'skipped': {}, 'ok': {}, 'dark': {'ec2-3-80-51-50.compute-1.amazonaws.com': 1}, 'failures': {}, 'ignored': {}, 'rescued': {}, 'processed': {'ec2-3-80-51-50.compute-1.amazonaws.com': 1}, 'changed': {}}[0m
[32m[2024-04-18 20:06:00] [INFO] PROVISIONER: Provision of "wazuh-manager" complete successfully.[0m
[32m[2024-04-18 20:06:00] [INFO] PROVISIONER: All components provisioned successfully.[0m
[37m[2024-04-18 20:06:00] [DEBUG] PROVISIONER: Provision summary: {'skipped': {}, 'ok': {}, 'dark': {'ec2-3-80-51-50.compute-1.amazonaws.com': 1}, 'failures': {}, 'ignored': {}, 'rescued': {}, 'processed': {'ec2-3-80-51-50.compute-1.amazonaws.com': 1}, 'changed': {}}[0m
[2024-04-18 20:06:00,730] [INFO] [3958] [ThreadPoolExecutor-0_0] [workflow_engine]: [provision-manager-linux-ubuntu-22.04-amd64] Finished task in 2.15 seconds.
[2024-04-18 20:06:00,740] [INFO] [3958] [ThreadPoolExecutor-0_0] [workflow_engine]: [allocate-agent-linux-ubuntu-18.04-amd64] Starting task.
The problem is that the VM, even though it was created by the allocator, is still not available in AWS to connect. This was fixed from the allocation module in this issue: https://github.com/wazuh/wazuh-qa/issues/5198 A connection check was incorporated to the raised VM, to guarantee that said VM is available before completing the allocation execution.
That said, tests were carried out both with and without threads and satisfactory results were obtained:
file | 3 threads |
---|---|
log | workflow.log |
input | test_agent.yaml.txt |
file | 1 threads |
---|---|
log | workflow.log |
input | test_agent.yaml.txt |
LGTM!
LGTM
Description
The objective of this issue is to solve the bug when executing the Workflow with threads. Currently, when executing an execution with several VMs, random errors are found due to execution with threads. If the same execution is performed without the threads parameter, it works correctly.
Tasks