Open kos-team opened 2 months ago
There's no separate seed nodes in our system, the seed labels get targeted with an algorithm to available nodes such as the first starting one. The system should always have minimum of 3 seed nodes (if there's 3 or more pods available) and they get set when the pod has reached Started state.
How did you resume the stopped system to get seed labels to nodes which are in pending state?
@burmanm Thanks for the explanation. I double checked and there are no seed-node labels on the unready Pods. Only one Pod came up and that's the only Pod with the seed-node label. This may not be a scheduling issue.
I just set the spec.stopped
to true
and then to false
. There are four Pods being scheduled successfully by the Kubernetes cluster, due to the physical resource constraint. However, only the rack-1-sts-0 Pod eventually becomes Ready, other 3 scheduled Pods come up but never become ready.
Running the nodetool status
against the only ready Pod gives the following result:
Datacenter: test-cluster
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 10.244.2.24 143.24 KiB 16 64.7% 18060b30-fba4-452c-a5c9-6e5429265a26 rack1
DN 10.244.1.26 ? 16 59.3% 92f2914a-179c-404d-9c15-aa81bfa3bdf9 rack2
DN 10.244.3.26 ? 16 76.0% a5428cd9-c0c2-490b-a2d0-7c513f8db691 rack3
while the currently scheduled Pods have the following IPs:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cass-operator-5d6c68d84b-tq52g 1/1 Running 0 72m 10.244.2.2 kind-worker4 <none> <none>
development-test-cluster-rack1-sts-0 2/2 Running 0 18m 10.244.2.24 kind-worker4 <none> <none>
development-test-cluster-rack1-sts-1 1/2 Running 0 18m 10.244.1.28 kind-worker <none> <none>
development-test-cluster-rack2-sts-0 0/2 Pending 0 18m <none> <none> <none> <none>
development-test-cluster-rack2-sts-1 1/2 Running 0 18m 10.244.4.18 kind-worker3 <none> <none>
development-test-cluster-rack3-sts-0 1/2 Running 0 18m 10.244.3.27 kind-worker2 <none> <none>
development-test-cluster-rack3-sts-1 0/2 Pending 0 18m <none> <none> <none> <none>
It seems that old Pods' IPs still exist in the membership list?
Only one Pod came up and that's the only Pod with the seed-node label.
This is as it should be, we only label seeds when they come up. The definition of seed isn't strict so it can be any node, we use a service with label selector to find nodes we at that point find suitable as seeds. They can change depending on pods going down and up.
The situation you have here is not probably related to seed nodes themselves actually, but the fact that you happen to have Pending pods in certain order. Our starting process has running index where it tries to balance each rack having equal amount of nodes up and with the order being 0->n in the starting phase. But the way you randomly have some racks in pending order makes the process wait for those to be scheduled (as the operator can't know from scheduler that they're not coming up).
Solving this correctly would require information from the scheduler or a workaround to ignore Pending pods. I'll change the topic of this to indicate that we need to ignore Pending pods in our starting scheduler.
However, I think for now the other workaround is to use cassandra.datastax.com/allow-parallel-starts
annotation as that would have started the remaining pods also (if they have previously bootstrapped which I assume they were). That process only needs one node to start normally to start the rest.
As for Stopped processing (I'm not sure if this was clear), the cluster membership states are not changed. Cassandra will remember all the nodes it had before Stopped was set to True. IPs might change, but the nodeIDs will remain the same.
What happened?
We are trying to resume a stopped Cassandra cluster. The cluster may not have the enough resource to schedule all the Pods, and we found that in some cases the seed nodes in racks may be in
pending
state while other nodes are scheduled but cannot start correctly because of the lack of seed node.What did you expect to happen?
The operator should prioritize seed nodes when creating the Cassandra pods. If the seed node is not scheduled, the entire rack is defunct.
How can we reproduce it (as minimally and precisely as possible)?
cass-operator version
1.22.0
Kubernetes version
1.29.1
Method of installation
Helm
Anything else we need to know?
No response
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: CASS-1