Open vponomaryov opened 1 year ago
another reproduction:
Kernel Version: 5.10.178-162.673.amzn2.x86_64
Scylla version (or git commit hash): 5.3.0~dev-20230512.7fcc4031229b
with build-id d6f9b433d295cf0420d28abedc89ff756eb0b75e
Operator Image: scylladb/scylla-operator:latest Operator Helm Version: v1.9.0-alpha.3-5-g34369da Operator Helm Repository: https://storage.googleapis.com/scylla-operator-charts/latest Cluster size: 4 nodes (i3.4xlarge)
Scylla Nodes used in this run: No resources left at the end of the run
OS / Image: `` (k8s-eks: eu-north-1)
Test: longevity-scylla-operator-3h-eks
Test id: be583031-c65c-41ae-9bc9-359c7d6739d6
Test name: scylla-operator/operator-master/eks/longevity-scylla-operator-3h-eks
Test config file(s):
Duplicated logs isn't a real issue.
The root cause of it is that scylla doesn't immediately switch gossip status of a node once we trigger decommision, so there may be multiple calls in between UN -> Decommisionning change.
But as far as I know, it doesn't interrupt service anyhow.
Duplicated logs isn't a real issue.
The root cause of it is that scylla doesn't immediately switch gossip status of a node once we trigger decommision, so there may be multiple calls in between UN -> Decommisionning change.
But as far as I know, it doesn't interrupt service anyhow.
Technically, it is not duplicated logs
, it is logs of made API calls.
Then, the issue about it is that for the scylla-operator
it is common API call retries
, but for us it is false negative
s.
How should we distinguish real decommission
problem from a retry?
Operator constantly retries actions until they succeed. You may observe multiple retries everywhere.
How should we distinguish real decommission problem from a retry?
Treat it as a black box, observe actions and status, but don't look inside.
If decommission won't happen in reasonable time, consider it failed. Number of retries doesn't matter as long required actions happens.
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
/lifecycle stale
The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
/lifecycle rotten
Describe the bug Running
decommission
DB pod/node operation we have following log messages in the target pod logs:There are
13
total redundant API calls to the Scylla. Where7
of them are done in less than1s
. All of it look like run of the same API query with attempt to decommission the same node again and again using the exponential back-off until the runningdecommission
operation succeeds as a result for the very first API call.So, the problems caused by this behavior:
To Reproduce Steps to reproduce the behavior:
Expected behavior Scylla-operator must not do redundant API calls and, moreover, log false errors like here.
Logs CI job: https://jenkins.scylladb.com/job/scylla-operator/job/operator-master/job/eks/job/longevity-scylla-operator-3h-eks/76 db-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/9c3e6d95-056c-4bae-be3a-9140da5e4d52/20230430_044725/db-cluster-9c3e6d95.tar.gz sct-runner-events: https://cloudius-jenkins-test.s3.amazonaws.com/9c3e6d95-056c-4bae-be3a-9140da5e4d52/20230430_044725/sct-runner-events-9c3e6d95.tar.gz sct-runner-log: https://cloudius-jenkins-test.s3.amazonaws.com/9c3e6d95-056c-4bae-be3a-9140da5e4d52/20230430_044725/sct-9c3e6d95.log.tar.gz kubernetes-log: https://cloudius-jenkins-test.s3.amazonaws.com/9c3e6d95-056c-4bae-be3a-9140da5e4d52/20230430_044725/kubernetes-9c3e6d95.tar.gz
Environment: