k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
189 stars 66 forks source link

Does the operator handle CASSANDRA-17883? #539

Open rhuffy opened 1 year ago

rhuffy commented 1 year ago

What happened?

While reading through open Cassandra issues, I came across CASSANDRA-17883. The issue is that, when a C node is removed, its IP address gets added to a list of ignoredEndpoints in MigrationCoordinator. In the C source, there is a TODO comment that describes the issue:

        // TODO The endpoint address is now ignored but when a node with the same address is added again later,
        //  there will be no way to include it in schema synchronization other than restarting each other node
        //  see https://issues.apache.org/jira/browse/CASSANDRA-17883 for details

When a pod bounces and comes up with a different IP, the old IP is removed from gossip, and I believe it's also added to ignoredEndpoints. If another pod bounces and gets that original IP, my concern is that any schema changes on that node will be ignored by the rest of the cluster.

Does the operator do anything to handle this situation?

What did you expect to happen?

No response

How can we reproduce it (as minimally and precisely as possible)?

I don't have a repro on a test k8s cluster since I'm not sure how to force pods to come up with particular IPs.

You can, however, reproduce in Cassandra dtests with these steps

  1. Create a 3 node cluster (127.0.0.1, 127.0.0.2, 127.0.0.3)
  2. Stop node1
  3. Stop node2, change its IP to 127.0.0.1 and start
  4. Create a keyspace on node2.
  5. Assert that node3 receives that schema change

Note that if node1 is restarted with some new IP, it will receive the schema change from node2, and pass it along to node3.

cass-operator version

1.15.0

Kubernetes version

1.24

Method of installation

No response

Anything else we need to know?

No response

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-22

burmanm commented 1 year ago

I assume this is the same as https://github.com/k8ssandra/cass-operator/issues/130 ?

adejanovski commented 1 year ago

@burmanm, it seems like a different (although somewhat related) issue. Here the nodes won't refuse to start, which is apparently what's described in #130. I'm not sure how the operator could detect that 🤔 The other nodes are the ones ignoring the node that inherited an old IP, so that node cannot tell (or can it?) that it's getting ignored. Unless we can detect some schema update failures in the mgmt-api and bounce the node so that it gets a new IP?