[K8S] `test_functional.py::test_mgmt_repair[3.2.5]` functional test fails

vponomaryov commented 9 months ago

Issue description

[x] This issue is a regression.
[ ] It is unknown if this issue is a regression.

Using latest scylla-operator version the test_functional.py::test_mgmt_repair[3.2.5] K8S functional test fails:

2023-12-17 01:29:31,683 f:wait.py         l:79   c:sdcm.wait            p:ERROR > Wait for: <lambda>: timeout - 300 seconds - expired
2023-12-17 01:29:31,684 f:wait.py         l:83   c:sdcm.wait            p:ERROR > last error: RetryError(<Future at 0x7f32e3016cb0 state=finished returned NoneType>)
FAILED

It appears having following output from the scylla-manager:

2023-12-17 01:24:18,641 f:remote_base.py  l:521  c:KubernetesCmdRunner  p:DEBUG > Running command "sctool tasks -c c21743b1-7e0c-4712-8875-3e709fe6b733"...
2023-12-17 01:24:18,736 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > +------------------------+-------------+--------+----------+---------+-------+------------------------+------------+---------+-----------------       -------+
2023-12-17 01:24:18,736 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > | Task                   | Schedule    | Window | Timezone | Success | Error | Last Success           | Last Error | Status  | Next                          |
2023-12-17 01:24:18,737 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > +------------------------+-------------+--------+----------+---------+-------+------------------------+------------+---------+-----------------       -------+
2023-12-17 01:24:18,737 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > | healthcheck/cql        | @every 15s  |        | UTC      | 138     | 0     | 17 Dec 23 01:24:14 UTC |            | DONE    | 17 Dec 23 01:24:       29 UTC |
2023-12-17 01:24:18,737 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > | healthcheck/rest       | @every 1m0s |        | UTC      | 35      | 0     | 17 Dec 23 01:23:39 UTC |            | DONE    | 17 Dec 23 01:24:       39 UTC |
2023-12-17 01:24:18,737 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > | healthcheck/alternator | @every 15s  |        | UTC      | 133     | 0     | 17 Dec 23 01:23:43 UTC |            | RUNNING |                               |
2023-12-17 01:24:18,737 f:base.py         l:222  c:KubernetesCmdRunner  p:DEBUG > +------------------------+-------------+--------+----------+---------+-------+------------------------+------------+---------+-----------------       -------+
2023-12-17 01:24:18,751 f:base.py         l:142  c:KubernetesCmdRunner  p:DEBUG > Command "sctool tasks -c c21743b1-7e0c-4712-8875-3e709fe6b733" finished with status 0

Impact

The mgmt repair task doesn't get created.

How frequently does it reproduce?

100% running on the EKS backend.

Installation details

Kernel Version: 5.10.199-190.747.amzn2.x86_64 Scylla version (or git commit hash): 5.5.0~dev-20231215.10a11c2886cf with build-id 0856cf64f66d65625e67fa033b9a47dee1d49a54

Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.12.0-alpha.0-144-g60f7824 Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i4i.xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: `` (k8s-eks: undefined_region)

Test: functional-eks Test id: 58be8dee-4c43-48f5-85ee-9433752f3d99 Test name: scylla-operator/operator-master/functional/functional-eks Test config file(s):

functional.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor 58be8dee-4c43-48f5-85ee-9433752f3d99` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=58be8dee-4c43-48f5-85ee-9433752f3d99) - Show all stored logs command: `$ hydra investigate show-logs 58be8dee-4c43-48f5-85ee-9433752f3d99` ## Logs: - **kubernetes-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/kubernetes-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/kubernetes-58be8dee.tar.gz) - **kubernetes-must-gather-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/kubernetes-must-gather-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/kubernetes-must-gather-58be8dee.tar.gz) - **db-cluster-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/db-cluster-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/db-cluster-58be8dee.tar.gz) - **sct-runner-events-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/sct-runner-events-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/sct-runner-events-58be8dee.tar.gz) - **sct-58be8dee.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/sct-58be8dee.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/sct-58be8dee.log.tar.gz) - **loader-set-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/loader-set-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/loader-set-58be8dee.tar.gz) - **parallel-timelines-report-58be8dee.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/parallel-timelines-report-58be8dee.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/58be8dee-4c43-48f5-85ee-9433752f3d99/20231217_024128/parallel-timelines-report-58be8dee.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/functional/job/functional-eks/141/) [Argus](https://argus.scylladb.com/test/89fdaa2f-5255-4696-a268-90cb7090482e/runs?additionalRuns[]=58be8dee-4c43-48f5-85ee-9433752f3d99)

fruch commented 9 months ago

According to cluster CRD there's a task but it's failed

My guess is we are starting the task before the update of the manager agents fully happened.

Before the task was triggered again and again by the operator, and now it's not anymore.

But again just a guess

vponomaryov commented 9 months ago

According to cluster CRD there's a task but it's failed

My guess is we are starting the task before the update of the manager agents fully happened.

Before the task was triggered again and again by the operator, and now it's not anymore.

But again just a guess

There is no agent or server update in this case. Everything is of the required version - 3.2.5.

fruch commented 9 months ago

According to cluster CRD there's a task but it's failed

My guess is we are starting the task before the update of the manager agents fully happened.

Before the task was triggered again and again by the operator, and now it's not anymore.

But again just a guess

There is no agent or server update in this case. Everything is of the required version - 3.2.5.

So why are we failing to communicate with the scylla node ?

scylladb / scylla-cluster-tests