opensearch-project / opensearch-build

🧰 OpenSearch / OpenSearch-Dashboards Build Systems
Apache License 2.0
132 stars 259 forks source link

[Bug]: Flaky alerting dashboards cypress tests during Jenkins executions #4791

Open AWSHurneyt opened 2 weeks ago

AWSHurneyt commented 2 weeks ago

Describe the bug

When the jenkins job runs the cypress tests in the Functional test repo, we found that 2 alerting dashboards cypress tests are flaky when executed using a security-enabled cluster. As described in this comment, we can see that each of the tests passes in previous test runs. https://github.com/opensearch-project/alerting-dashboards-plugin/issues/975#issuecomment-2174388518

To reproduce

We have difficulty reproducing the flakiness locally. The 2 tests that are called out as failing in the issue above pass reliably when executed locally using a security-enabled cluster.

  1. Use docker to create a security-enabled domain with the frontend running
  2. Use the steps in the functional test repo developer guide to execute the tests against the docker cluster - https://github.com/opensearch-project/opensearch-dashboards-functional-test/blob/main/DEVELOPER_GUIDE.md

Expected behavior

No response

Screenshots

This test fail in this instance because the "select your tenant" window wasn't closed. Screenshot 2024-06-18 at 1 17 41 PM

This test failed because it timed out waiting for the UI to load. https://ci.opensearch.org/ci/dbc/integ-test-opensearch-dashboards/2.15.0/7742/linux/x64/deb/test-results/5966/integ-test/alertingDashboards/with-security/cypress-screenshots/plugins/alerting-dashboards-plugin/document_level_monitor_spec.js/DocumentLevelMonitor+--+can+be+created+--+by+extraction+query+editor+--+before+each+hook+%28failed%29.png

This test failed because it timed out waiting for the UI to load. Screenshot 2024-06-18 at 4 29 53 PM

Host / Environment

No response

Additional context

No response

Relevant log output

No response

getsaurabh02 commented 2 weeks ago

thanks @AWSHurneyt for creating the issue. When we say these tests are passing reliably when executed locally using a security-enabled cluster, are we running these individual tests or the full suite? I am wondering since the issue is related to the timeout, where test failed because it timed out waiting for the UI to load, it is a cleanup issue when all tests run together?

Can you also point out any resource contention issue, or limitations that these test might be running into based on the OS and OSD logs from the point of failures?

gaiksaya commented 1 week ago

[Triage] Couple of questions:

  1. Are these tests passing on GitHub Actions workflow?
  2. Do they pass in single run or need to be run multiple times on GHA/local as well?
  3. Can you mention if there are any additional configs (JVM, memory settings, etc) being added this test cluster while running the tests?

Please work with us before next release to get this fixed. Thanks!