scylladb / scylla-operator

The Kubernetes Operator for ScyllaDB
https://operator.docs.scylladb.com/
Apache License 2.0
340 stars 175 forks source link

Flake - ScyllaCluster evictions [It] should allow one disruption #2146

Closed zimnx closed 1 month ago

zimnx commented 1 month ago

Link to the job that flaked.

https://prow.scylla-operator.scylladb.com/view/gs/scylla-operator-prow/pr-logs/pull/scylladb_scylla-operator/2137/pull-scylla-operator-master-e2e-gke-parallel-clusterip/1843658776314384384

Snippet of what failed.

 [FAILED] [530.082 seconds]
ScyllaCluster evictions [It] should allow one disruption
github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_evictions.go:20
  Timeline >>
  STEP: Creating a new namespace @ 10/08/24 14:39:08.599
  Oct  8 14:39:08.680: INFO: Created namespace "e2e-test-scyllacluster-sngm2-0-cnxbt".
  STEP: Waiting for service account token Secret "e2e-user-token" in namespace "e2e-test-scyllacluster-sngm2-0-cnxbt". @ 10/08/24 14:39:09.361
  STEP: Waiting for default ServiceAccount in namespace "e2e-test-scyllacluster-sngm2-0-cnxbt". @ 10/08/24 14:39:09.615
  STEP: Waiting for kube-root-ca.crt in namespace "e2e-test-scyllacluster-sngm2-0-cnxbt". @ 10/08/24 14:39:09.736
  STEP: Creating a ScyllaCluster @ 10/08/24 14:39:09.897
  STEP: Waiting for the ScyllaCluster to roll out (RV=5727) @ 10/08/24 14:39:10.077
  Oct  8 14:43:59.373: INFO: ScyllaCluster e2e-test-scyllacluster-sngm2-0-cnxbt/basic-glp79 (RV=12958) is rolled out
  STEP: Verifying the ScyllaCluster @ 10/08/24 14:43:59.373
  Oct  8 14:43:59.476: INFO: Found 2 pvc(s) in namespace "e2e-test-scyllacluster-sngm2-0-cnxbt"
  Oct  8 14:43:59.476: INFO: Found 2 pvc(s) for ScyllaCluster "e2e-test-scyllacluster-sngm2-0-cnxbt/basic-glp79"
  STEP: Waiting for the ScyllaCluster(s) to reach consistency ALL @ 10/08/24 14:43:59.546
  Oct  8 14:43:59.727: INFO: ScyllaDB nodes have reached status consistency.
  STEP: Inserting data @ 10/08/24 14:43:59.769
  Oct  8 14:43:59.769: INFO: Creating CQL session (hosts="10.23.210.234, 10.23.211.185")
  STEP: Inserting data @ 10/08/24 14:43:59.774
  Oct  8 14:43:59.774: INFO: Creating keyspace "8xqqfsfh" with RF "'replication_factor': 2"
  Oct  8 14:44:00.470: INFO: Creating table "8xqqfsfh"."test"
  Oct  8 14:44:01.489: INFO: Inserting data into table "8xqqfsfh"."test"
  Oct  8 14:44:01.506: INFO: Awaiting schema agreement
  Oct  8 14:44:01.507: INFO: Schema agreement reached
  STEP: Verifying the data @ 10/08/24 14:44:01.507
  Oct  8 14:44:01.507: INFO: Reading data from table "8xqqfsfh"."test"
  STEP: Allowing the first pod to be evicted @ 10/08/24 14:44:01.51
  [FAILED] in [It] - github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_evictions.go:54 @ 10/08/24 14:44:01.529
  STEP: Collecting events from namespace "e2e-test-scyllacluster-sngm2-0-cnxbt". @ 10/08/24 14:44:01.53 
   [FAILED] Unexpected error:
      <*errors.StatusError | 0xc00036a460>: 
      Cannot evict pod as it would violate the pod's disruption budget.
      {
          ErrStatus: {
              TypeMeta: {Kind: "", APIVersion: ""},
              ListMeta: {
                  SelfLink: "",
                  ResourceVersion: "",
                  Continue: "",
                  RemainingItemCount: nil,
              },
              Status: "Failure",
              Message: "Cannot evict pod as it would violate the pod's disruption budget.",
              Reason: "TooManyRequests",
              Details: {
                  Name: "",
                  Group: "",
                  Kind: "",
                  UID: "",
                  Causes: [
                      {
                          Type: "DisruptionBudget",
                          Message: "The disruption budget basic-glp79 needs 1 healthy pods and has 2 currently",
                          Field: "",
                      },
                  ],
                  RetryAfterSeconds: 0,
              },
              Code: 429,
          },
      }
  occurred
  In [It] at: github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_evictions.go:54 @ 10/08/24 14:44:01.529
  Full Stack Trace
    github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster.init.func5.1()
        github.com/scylladb/scylla-operator/test/e2e/set/scyllacluster/scyllacluster_evictions.go:54 +0x7a5 
zimnx commented 1 month ago

Looks like Cleanup job matches PDB selector and prevents evictions even when all nodes are up.

apiVersion: policy/v1
kind: PodDisruptionBudget
[...]
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: scylla
      app.kubernetes.io/managed-by: scylla-operator
      app.kubernetes.io/name: scylla
      scylla/cluster: basic-glp79
status:
  conditions:
  - lastTransitionTime: "2024-10-08T14:44:00Z"
    message: jobs.batch does not implement the scale subresource
    observedGeneration: 1
    reason: SyncFailed
    status: "False"
    type: DisruptionAllowed
  currentHealthy: 2
  desiredHealthy: 1
  disruptionsAllowed: 0
  expectedPods: 2
  observedGeneration: 1