red-hat-storage / ocs-ci

https://ocs-ci.readthedocs.io/en/latest/
MIT License
109 stars 165 forks source link

`test_rbd_capacity_workload_alerts` fails in external deployments and leaves the cluster in an unhealthy state #9499

Open sagihirshfeld opened 3 months ago

sagihirshfeld commented 3 months ago

Tier2 tests on external deployments consistently fail after test_rbd_capacity_workload_alerts which runs for multiple hours.

For example, in https://url.corp.redhat.com/2b73552 over 90 tests were skipped with the Ceph health check failed at setup message, and almost if not all of the MCG tests failed because the noobaa token couldn't be retrieved:

NB RPC token was not retrieved successfully within the time limit.

This might be a bug and not an automation issue, but further investigation by the test/feature owner is required.

RP links for reference:

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

abdulkandathil commented 1 week ago

Following test in tier1 is also failing on IBMZ with same error.

11:28:48 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: /home/automation/ocs-ci/data/mcg-cli api auth_api create_auth '{"role": "admin", "system": "noobaa", "email": "*****", "password": "*****"}' -ojson -n openshift-storage
11:38:48 - MainThread - ocs_ci.utility.utils - ERROR  - Exception raised during iteration: Command '['/home/automation/ocs-ci/data/mcg-cli', 'api', 'auth_api', 'create_auth', '{"role": "admin", "system": "noobaa", "email": "admin@noobaa.io", "password": "lq3qKdUgSqiImwLDNWbtzQ=="}', '-ojson', '-n', 'openshift-storage']' timed out after 600 seconds
Traceback (most recent call last):
  File "/home/automation/ocs-ci/ocs_ci/utility/utils.py", line 1446, in __iter__
    yield self.func(*self.func_args, **self.func_kwargs)
  File "/home/automation/ocs-ci/ocs_ci/ocs/resources/mcg.py", line 160, in internal_retrieval_logic
    rpc_response = self.send_rpc_query(
  File "/home/automation/ocs-ci/ocs_ci/ocs/resources/mcg.py", line 322, in send_rpc_query
    cli_output = self.exec_mcg_cmd(
  File "/home/automation/ocs-ci/ocs_ci/ocs/resources/mcg.py", line 877, in exec_mcg_cmd
    result = exec_cmd(
  File "/home/automation/ocs-ci/ocs_ci/utility/utils.py", line 674, in exec_cmd
    completed_process = subprocess.run(
  File "/usr/lib/python3.10/subprocess.py", line 505, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 1154, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
  File "/usr/lib/python3.10/subprocess.py", line 2022, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.10/subprocess.py", line 1198, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '['/home/automation/ocs-ci/data/mcg-cli', 'api', 'auth_api', 'create_auth', '{"role": "admin", "system": "noobaa", "email": "admin@noobaa.io", "password": "lq3qKdUgSqiImwLDNWbtzQ=="}', '-ojson', '-n', 'openshift-storage']' timed out after 600 seconds
11:38:48 - MainThread - /home/automation/ocs-ci/ocs_ci/ocs/resources/mcg.py - ERROR  - NB RPC token was not retrieved successfully within the time limit.