k8ssandra / k8ssandra-operator

The Kubernetes operator for K8ssandra
https://k8ssandra.io/
Apache License 2.0
176 stars 79 forks source link

Rolling out a multi-DC cluster with Reaper + SG fails to deploy Reaper, SG #937

Closed Miles-Garnsey closed 1 year ago

Miles-Garnsey commented 1 year ago

What happened?

I applied the below manifest to a two DC cluster managed by k8ssandra-operator . Stargate and Reaper failed to roll out. Both DCs present as healthy according to pod statuses:

kubectl --context gke_k8ssandra_australia-southeast1-a_nz-cncf-1 get pods -n k8ssandra-operator
W0330 11:39:36.926982   77857 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                                               READY   STATUS    RESTARTS        AGE
cass-operator-controller-manager-df55c7f45-mtjf8   1/1     Running   6 (3d ago)      3d1h
k8ssandra-operator-64fc9b86b4-77d6b                1/1     Running   1 (2d22h ago)   3d1h
test-dc1-default-sts-0                             2/2     Running   0               14m
test-dc1-default-sts-1                             2/2     Running   0               14m
test-dc1-default-sts-2                             2/2     Running   0               14m
kubectl --context gke_k8ssandra_australia-southeast1-a_nz-cncf-2 get pods -n k8ssandra-operator
W0330 11:40:10.005583   77903 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME                                               READY   STATUS    RESTARTS        AGE
cass-operator-controller-manager-df55c7f45-bdl44   1/1     Running   1 (7h29m ago)   2d22h
k8ssandra-operator-6fc6bd6b4c-k9mxm                1/1     Running   0               2d22h
test-dc2-default-sts-0                             2/2     Running   0               9m25s
test-dc2-default-sts-1                             2/2     Running   0               9m25s
test-dc2-default-sts-2                             2/2     Running   0               9m25s
kubectl get k8ssandraclusters.k8ssandra.io -n k8ssandra-operator
W0330 11:40:33.170967   77969 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
NAME   ERROR
test   CALL list keyspaces system_traces failed on all datacenter dc2 pods

Did you expect to see something different?

Damn right I did.

How to reproduce it (as minimally and precisely as possible):

Apply this manifest to a two DC cluster:

apiVersion: k8ssandra.io/v1alpha1
kind: K8ssandraCluster
metadata:
  name: test
  namespace: k8ssandra-operator
spec:
  stargate:
    size: 1
  reaper: {}
  auth: false
  cassandra:
    serverVersion: 4.0.4
    serverType: cassandra
    networking:
      hostNetwork: true
    datacenters:
      - metadata:
          name: dc1
        size: 3
        telemetry:
          prometheus:
            enabled: true
        cdc:
          pulsarServiceUrl: pulsar://pulsar-proxy.pulsar.svc.cluster.local:6650
          topicPrefix: persistent://public/default/events-
          cdcWorkingDir: /var/lib/cassandra/cdc
        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: standard
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi
      - metadata:
          name: dc2
        size: 3
        k8sContext: gke_k8ssandra_australia-southeast1-a_nz-cncf-2
        storageConfig:
          cassandraDataVolumeClaimSpec:
            storageClassName: standard
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 5Gi

Environment

k8ssandra/k8ssandra-operator:v1.6.1

Logs

2023-03-29T22:36:52.549Z    ERROR Failed to CALL list keyspaces system_traces on pod test-dc2-default-sts-2 {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"test","namespace":"k8ssandra-operator"}, "namespace": "k8ssandra-operator", "name": "test", "reconcileID": "4c8e013a-cc8c-4a2d-b732-6425ce7510db", "K8ssandraCluster": "k8ssandra-operator/test", "CassandraDatacenter": "k8ssandra-operator/dc2", "K8SContext": "gke_k8ssandra_australia-southeast1-a_nz-cncf-2", "error": "Get \"http://10.152.15.209:8080/api/v0/ops/keyspace?keyspaceName=system_traces\": context deadline exceeded"}
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/pkg/cassandra.(*defaultManagementApiFacade).ListKeyspaces
k8ssandra-operator
    /workspace/pkg/cassandra/management.go:195
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/pkg/cassandra.(*defaultManagementApiFacade).EnsureKeyspaceReplication
k8ssandra-operator
    /workspace/pkg/cassandra/management.go:289
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).updateReplicationOfSystemKeyspaces
k8ssandra-operator
    /workspace/controllers/k8ssandra/schemas.go:157
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).checkSchemas
k8ssandra-operator
    /workspace/controllers/k8ssandra/schemas.go:43
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcileDatacenters
k8ssandra-operator
    /workspace/controllers/k8ssandra/datacenters.go:199
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).reconcile
k8ssandra-operator
    /workspace/controllers/k8ssandra/k8ssandracluster_controller.go:143
k8ssandra-operator
github.com/k8ssandra/k8ssandra-operator/controllers/k8ssandra.(*K8ssandraClusterReconciler).Reconcile
k8ssandra-operator
    /workspace/controllers/k8ssandra/k8ssandracluster_controller.go:91
k8ssandra-operator
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
k8ssandra-operator
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:121
k8ssandra-operator
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
k8ssandra-operator
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:320
k8ssandra-operator
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
k8ssandra-operator
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:273
k8ssandra-operator
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
k8ssandra-operator
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.13.0/pkg/internal/controller/controller.go:234
k8ssandra-operator
2023-03-29T22:36:52.549Z    INFO client::callNodeMgmtEndpoint {"controller": "k8ssandracluster", "controllerGroup": "k8ssandra.io", "controllerKind": "K8ssandraCluster", "K8ssandraCluster": {"name":"test","namespace":"k8ssandra-operator"}, "namespace": "k8ssandra-operator", "name": "test", "reconcileID": "4c8e013a-cc8c-4a2d-b732-6425ce7510db", "K8ssandraCluster": "k8ssandra-operator/test", "CassandraDatacenter": "k8ssandra-operator/dc2", "K8SContext": "gke_k8ssandra_australia-southeast1-a_nz-cncf-2"}
Miles-Garnsey commented 1 year ago

Further diagnostics:

Additional seeds service reports the following

No target 10.152.15.210 No target 10.152.15.231 No target 10.152.15.209

Attempting to CQL to these nodes in DC2 from DC1

This appears to work:

cassandra@gke-nz-cncf-1-default-pool-73f3d76d-bd8s:/$ cqlsh 10.152.15.209
Connected to test at 10.152.15.209:9042
[cqlsh 6.0.0 | Cassandra 4.0.4 | CQL spec 3.4.5 | Native protocol v5]
Use HELP for help.

Attempting to contact DC2 management API from DC1


curl 10.152.15.210:8080/api/v0/metadata/endpoints
<A bunch of binary data here that I won't reproduce, but I can access the endpoint no problem>

My conclusion is that this is not a cloud environment or firewalling issue. It appears to be a legitimate bug in either management API or the way the operator attempts to access the nodes.
adejanovski commented 1 year ago

Creating a firewall rule to allow ingress calls on port 8080 apparently solved the issue. Could you confirm @Miles-Garnsey ?

Miles-Garnsey commented 1 year ago

We still don't really understand what is wrong with these firewall rules and have decided to move on with our lives, attributing any strange behaviour to ghosts (although we are also open to the possibility that poltergeists are responsible).

In either event, it is a cloud environment issue not an operator issue, so closing.

5olitude commented 1 year ago

Hey Im facing the same issue while deploying k8ssandra in gke

Miles-Garnsey commented 1 year ago

@5olitude, when we encountered this we determined that it was due to firewall rules in GKE not allowing traffic on port 8080. We solved this by allowing ingress to the nodes on port 8080. Please let us know if that solves your issue.

5olitude commented 1 year ago

@Miles-Garnsey Yes Miles I resolved the issue by enabling the firewall rules by allowing traffic on port 8080 ,7777 , thanks for your reply