jetstack / navigator

Managed Database-as-a-Service (DBaaS) on Kubernetes
Apache License 2.0
271 stars 31 forks source link

Cassandra cluster can not recover if a C* pod and its data are deleted #350

Open wallrj opened 6 years ago

wallrj commented 6 years ago

If a C* Pod and its data are deleted, a replacement Pod will be started by the StatefulSet controller, but it will be started with an empty /var/lib/cassandra data directory.

This will cause a new Cassandra node UUId to be generated, and when the C* node attempts to join the cluster, it will be considered to be a new, rather than a replacement node.

This is much more likely if the cluster is configured to use persistent local storage rather than re-attachable networked PVs.

You can work around this by storing the C node UUID as a field on e.g. a Pilot resource or centrally on the CassandraCluster resource, so that when a pilot is restarted, it can discover the original C node UUID and start the Cassandra process with -Dcassandra.replace-address=<original_uuid>.

See is cassandra-kubernetes-hostid written as part of Improve Cassandra Example , which suggests that you can supply the "hostid" to -Dcassandra.replace_address when starting the C* node.

/kind bug

munnerz commented 6 years ago

I'm going to reclassify this as a feature, as storing the UUID on the Pilot resource is a feature we did not previously support, and so when it is lost through some external means, it is expected that it is not recoverable 😄

/kind feature

yanniszark commented 6 years ago

Just for clarification :smile: When a Pod is deleted, the PV attached to that pod should not be deleted. That means there are two types of failure:

  1. The Pod is deleted/restarted but the node is healthy. If using local storage (I assume we are talking about Local Persistent Volumes (Beta) because hostpath is not supposed to be used for multi-node clusters), from what I know, this should be handled by C*, because you still have the PV with the data (source).
  2. The node is unhealthy and all its resources are unavailable: This is the case where a node replacement should occur, so I assume this issue is talking about this. When using Network-Attached Storage (like on a Cloud Provider - NOT reccomended by DataStax) the volume can be re-attached and this becomes a type 1 failure.

If I got something wrong please correct me

wallrj commented 6 years ago

Thanks @yanniszark. That's an accurate summary :+1:

yanniszark commented 6 years ago

I got too busy with the details and forgot to ask the actual question :smile: So in this issue, you are talking about a type 2 failure using local storage?

wallrj commented 6 years ago

So in this issue, you are talking about a type 2 failure using local storage?

Yep. :-) Are you using / testing Navigator? Or interested in contributing? We'd be very interested to get your feedback on the project.

yanniszark commented 6 years ago

I am interested in both :smile: For my thesis, I am looking at developing a cloud-native solution for C to run on K8s. This project is very interesting and I have learned a lot by browsing the issues you have encountered. I am also looking at the Priam project by Netflix. Their model is very similar to what you have: a sidecar running alongside C and a centralized storage (SimpleDB in their case, etcd in K8s). Their system is tested in production for many years so I was thinking it could provide some good guidelines.

yanniszark commented 6 years ago

A little follow-up on this: From the source code, it seems such a thing is not feasible. I will permalink the files I traced to arrive to this conclusion:

First, in StorageService.java where the join function for a new node is located: https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/service/StorageService.java#L475-L490

As we can see, it calls DatabaseDescriptor.getReplaceAddress() to get the replace address. This is of InetAddressAndPort type, so incompatible from the start, but we'll keep digging in case it the UUID is resolved before reaching us.

The DatabaseDescriptor.getReplaceAddress() function:

https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/config/DatabaseDescriptor.java#L1351-L1365

It calls InetAddressAndPort.getByName to retrieve the replace_address. The relevant function is this :

https://github.com/apache/cassandra/blob/5cc68a87359dd02412bdb70a52dfcd718d44a5ba/src/java/org/apache/cassandra/locator/InetAddressAndPort.java#L137-L157

Which calls upon HostAndPort.fromString to get the address, which is part of the Google Core Libraries. Description can be found here.

Consequently, it would seem that it is not possible to provide host_id to the replace_address option. In the older releases of Cassandra, there was a replace_node option which accepted UUID, but it was deprecated in favor of replace_address.

In this case, it seems the way to go is to store the ip address of the node.