k8ssandra / cass-operator

The DataStax Kubernetes Operator for Apache Cassandra
https://docs.datastax.com/en/cass-operator/doc/cass-operator/cassOperatorGettingStarted.html
Apache License 2.0
179 stars 63 forks source link

Starting process should ignore Pending state pods and start the next one in the rack #696

Open kos-team opened 2 weeks ago

kos-team commented 2 weeks ago

What happened?

We are trying to resume a stopped Cassandra cluster. The cluster may not have the enough resource to schedule all the Pods, and we found that in some cases the seed nodes in racks may be in pending state while other nodes are scheduled but cannot start correctly because of the lack of seed node.

What did you expect to happen?

The operator should prioritize seed nodes when creating the Cassandra pods. If the seed node is not scheduled, the entire rack is defunct.

How can we reproduce it (as minimally and precisely as possible)?

  1. Deploy the cass-operator
  2. Deploy CassandraDB with the following CR yaml in a resource limited cluster. For example, a Kind cluster with only 4 nodes.
    apiVersion: cassandra.datastax.com/v1beta1
    kind: CassandraDatacenter
    metadata:
    name: test-cluster
    spec:
    clusterName: development
    serverType: cassandra
    serverVersion: "4.1.2"
    managementApiAuth:
    insecure: {}
    size: 6
    storageConfig:
    cassandraDataVolumeClaimSpec:
      storageClassName: standard
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 10Gi
    racks:
    - name: rack1
    config:
    jvm-server-options:
      initial_heap_size: "1G"
      max_heap_size: "1G"
    cassandra-yaml:
      num_tokens: 16
      authenticator: PasswordAuthenticator
      authorizer: CassandraAuthorizer
      role_manager: CassandraRoleManager
  3. In some cases the seed node of the rack is not scheduled while other nodes are being scheduled.

cass-operator version

1.22.0

Kubernetes version

1.29.1

Method of installation

Helm

Anything else we need to know?

No response

┆Issue is synchronized with this Jira Story by Unito

burmanm commented 2 weeks ago

There's no separate seed nodes in our system, the seed labels get targeted with an algorithm to available nodes such as the first starting one. The system should always have minimum of 3 seed nodes (if there's 3 or more pods available) and they get set when the pod has reached Started state.

How did you resume the stopped system to get seed labels to nodes which are in pending state?

kos-team commented 2 weeks ago

@burmanm Thanks for the explanation. I double checked and there are no seed-node labels on the unready Pods. Only one Pod came up and that's the only Pod with the seed-node label. This may not be a scheduling issue.

I just set the spec.stopped to true and then to false. There are four Pods being scheduled successfully by the Kubernetes cluster, due to the physical resource constraint. However, only the rack-1-sts-0 Pod eventually becomes Ready, other 3 scheduled Pods come up but never become ready.

Running the nodetool status against the only ready Pod gives the following result:

Datacenter: test-cluster
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address      Load        Tokens  Owns (effective)  Host ID                               Rack 
UN  10.244.2.24  143.24 KiB  16      64.7%             18060b30-fba4-452c-a5c9-6e5429265a26  rack1
DN  10.244.1.26  ?           16      59.3%             92f2914a-179c-404d-9c15-aa81bfa3bdf9  rack2
DN  10.244.3.26  ?           16      76.0%             a5428cd9-c0c2-490b-a2d0-7c513f8db691  rack3

while the currently scheduled Pods have the following IPs:

NAME                                   READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
cass-operator-5d6c68d84b-tq52g         1/1     Running   0          72m   10.244.2.2    kind-worker4   <none>           <none>
development-test-cluster-rack1-sts-0   2/2     Running   0          18m   10.244.2.24   kind-worker4   <none>           <none>
development-test-cluster-rack1-sts-1   1/2     Running   0          18m   10.244.1.28   kind-worker    <none>           <none>
development-test-cluster-rack2-sts-0   0/2     Pending   0          18m   <none>        <none>         <none>           <none>
development-test-cluster-rack2-sts-1   1/2     Running   0          18m   10.244.4.18   kind-worker3   <none>           <none>
development-test-cluster-rack3-sts-0   1/2     Running   0          18m   10.244.3.27   kind-worker2   <none>           <none>
development-test-cluster-rack3-sts-1   0/2     Pending   0          18m   <none>        <none>         <none>           <none>

It seems that old Pods' IPs still exist in the membership list?

burmanm commented 2 weeks ago

Only one Pod came up and that's the only Pod with the seed-node label.

This is as it should be, we only label seeds when they come up. The definition of seed isn't strict so it can be any node, we use a service with label selector to find nodes we at that point find suitable as seeds. They can change depending on pods going down and up.

The situation you have here is not probably related to seed nodes themselves actually, but the fact that you happen to have Pending pods in certain order. Our starting process has running index where it tries to balance each rack having equal amount of nodes up and with the order being 0->n in the starting phase. But the way you randomly have some racks in pending order makes the process wait for those to be scheduled (as the operator can't know from scheduler that they're not coming up).

Solving this correctly would require information from the scheduler or a workaround to ignore Pending pods. I'll change the topic of this to indicate that we need to ignore Pending pods in our starting scheduler.

However, I think for now the other workaround is to use cassandra.datastax.com/allow-parallel-starts annotation as that would have started the remaining pods also (if they have previously bootstrapped which I assume they were). That process only needs one node to start normally to start the rest.

As for Stopped processing (I'm not sure if this was clear), the cluster membership states are not changed. Cassandra will remember all the nodes it had before Stopped was set to True. IPs might change, but the nodeIDs will remain the same.