wazuh / wazuh-qa

Wazuh - Quality Assurance
GNU General Public License v2.0
61 stars 30 forks source link

Not all agents get disconnected in `test_shutdown_message` #5199

Closed juliamagan closed 1 week ago

juliamagan commented 1 month ago

Description

During the system tests launched for 4.8.0 Beta 5 at https://github.com/wazuh/wazuh/issues/22824, it has been found that not all agents go offline:

E       AssertionError: assert 33 == 40
E        +  where 33 = len(['Disconnected', 'Disconnected', 'Disconnected', 'Disconnected', 'Disconnected', 'Disconnected', ...])

This test has been modified recently, so we should check if the waiting time for the check is as expected, because if it is, even if no errors appear in the managers, it could indicate some kind of performance error. After all, after several executions, the error seems consistent.

juliamagan commented 1 month ago

After talking to @TomasTurina, it was found that when the agent stops it sends HC_SHUTDOWN to the manager, which immediately shows the agent as Disconnected. However, reviewing the logs, it has been seen that the manager receives 50~52 shutdown messages when there are only 40 agents. We need to check if there are old messages or if some messages are being duplicated. Also, with thread.join() it waits for all the agents to be stopped, so all the agents should appear as Disconnected.

juliamagan commented 4 weeks ago

By monitoring the logs and the agent statuses, we have been able to see that the test started when there were agents that were not yet Active, which could affect the results. The necessary logic has been added to avoid this, but it is being tested to see how much time is needed for all the agents to be active.

juliamagan commented 3 weeks ago

On hold due to Beta 6 testing

juliamagan commented 2 weeks ago

With the proposed solution, the test passes without problem when launched individually, but when all tests in the environment are launched it fails. We are checking if the environment is dirty from the previous tests, but these tests take 1:40h, which makes it very slow to debug.

juliamagan commented 2 weeks ago

Finally, it was found that the environment was dirty and was not registered in the expected manager. It remains to upload the results of the complete test set to ensure that it does not fail.