Use StatefulSets instead of Deployment

lilic commented 7 years ago

Currently we are using Deployments to deploy our Habitat Service, but since we do not know what are deploying, what type of service that is, it could be anything from a DB to a simple Rails application. We should not just assume our Habitat service would be stateless.

Couple of advantages of StatefulSets:

Graceful deployment and scaling
Stable network identity
Graceful deletion and termination
Stable, persistent storage

These would be very useful especially if our service is for example a DB.

asymmetric commented 7 years ago

I think that for the moment, we expect our services to be stateless, and that stateful apps will be outside of the cluster.

So maybe this can be re-visited at a later point?

asymmetric commented 7 years ago

Cons:

If an application doesn’t require any stable identifiers or ordered deployment, deletion, or scaling, you should deploy your application with a controller that provides a set of stateless replicas.
More complex management (storage must be explicitly deleted, needs headless service)
Pods are created/terminated sequentially, rather than in parallel

asymmetric commented 6 years ago

For reference, two operators that deal with persistend data: 1 and 2.

asymmetric commented 6 years ago

Interesting article detailing some pros and cons.

asymmetric commented 6 years ago

Another option (suggested by @blixtra) is that we support both Deployments and StatefulSets, depending on the specific usecase. For example, Habitat services that require persistence and can benefit from some of the benefits from StatefulSets would be deployed as those.

jeremymv2 commented 6 years ago

Chef Server is a great example app we would love to deploy as a StatefulSet being that Elasticsearch and Postgresql are the backends in the stack which require stable, persistent storage.

asymmetric commented 6 years ago

@jeremymv2 The article you posted doesn't make a very compelling case for using StatefulSets IMO. The main point it makes is that StatefulSets have terminationGracePeriodSeconds, but I'm not sure how important that is for us.

jeremymv2 commented 6 years ago

@asymmetric I've been going back and forth myself trying to understand when persistent storage wouldn't make sense to implement under a Deployment in the operator.

My number one concern is to avoid the possibility of data corruption via concurrent pod access if Deployment replicas are > 1.

There is some good info here regarding storage guarantees for pods: https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#guarantees-provided-by-replica-sets-and-replication-controllers

Reading the above, gives me some pause for concern:

_ReplicaSets and ReplicationControllers both attempt to preserve availability of their constituent pods over ensuring at most one (of a pod) semantics. So a replica set to scale 1 will immediately create a new pod when it observes an old pod has begun graceful deletion, and as a result at many points in the lifetime of a replica set there will be 2 copies of a pod's processes running concurrently. Only access to exclusive resources like storage can prevent that simultaneous execution.

_Deployments, being based on replica sets, can offer no stronger guarantee.__

One thing I'm still unclear on though, is could a PersistentVolumeClaim, which is usable in a Deployment, be the extra item that helps ensure there is no concurrent access violation?

asymmetric commented 6 years ago

It seems the issues outlined here make implementing persistence with Deployments a bad idea, because Pods would end up sharing Volumes, which means DB processes would have to compete for access to the filesystem, which is bad.

Unless there are other considerations, I'd consider this closed.

/cc @blixtra.

lilic commented 6 years ago

@asymmetric Why close this? I thought we agreed to go with StatefulSets? :)

blixtra commented 6 years ago

Seems like you've all done the research. So if the outcome is to use StatefulSets then fine.

But the question that I still see open, is if it's so much extra work to support both StatefulSets and Deployments. What are the disadvantages for stateless applications if we go with an all StatefulSet solution? How much does the ordering requirement effect stateless applications in practice; slower to deploy and remove, would there always be a mount even if not used, etc.?

asymmetric commented 6 years ago

@LiliC Sorry, I thought the title was "Should we...?", and since we answered that, we could close :)

asymmetric commented 6 years ago

@blixtra as @LiliC mentioned, the ordering can be relaxed.

To answer your questions from before:

the Headless Service is used to give stable DNS names to the pods, so that clients can consume them. The headless service doesn't act as a load balancer (something normal services do).
StatefulSets do not create ReplicaSets under the hood.
The mount is optional

asymmetric commented 6 years ago

Just realized we don't necessarily need the Headless Service.

Habitat services use the gossip protocol to find each other (as long as they can bootstrap with one peer), and they find each other by IP; which means that the main use of the Headless Service, i.e. returning a list of DNS names of members, does not apply to us.

jeremymv2 commented 6 years ago

kdk5l

asymmetric commented 6 years ago

Another thing (as mentioned in today's standup):

There's the possibility that we could use an alternative approach to ring joining than the current peer-watch-file+ConfigMap based one:

if we use a Headless Service, and
If a Habitat ring can be formed where each supervisor is started with --peer X, where X is the DNS name of one of the nodes (meaning, one of the supervisors would have itself as the bootstrap)
Then we could just start each Pod with --peer $hostname-of-pod-0 as argument, and the ring would be formed

This would have the advantage of being more Kubernetes-native, and the downside of forcing us to start a Headless Service.

Not saying we should do this, just writing thoughts down.

jeremymv2 commented 6 years ago

@asymmetric That's an interesting strategy and in truth, I don't think you even need the Headless Service, just the use of hostname is enough for the pod. I've got a POC of that working here: https://github.com/jeremymv2/launch-chef-in-kubernetes/blob/master/chef-server-pod.yml#L46-L47

This allows the services in the POD to form the ring.

asymmetric commented 6 years ago

The main thing I'm trying to figure out now is whether we want to let users decide what kind of PersistentVolume they will provision, or if the operator should/can force the decision.

The downside with us doing it is that designing a solution that works (across environments) and is easy to use has eluded me so far.

hostPath doesn't work on multi-node setups, as it has no notion of node affinity (so it's a no-go)
local is alpha, and doesn't yet support dynamic provisioning
- should move to beta in 1.10
- dynamic provisioning is maybe landing in 1.11

So I guess that leaves us with either:

local and static provisioning, or
allowing users to define their own storage classes, with dynamic provisioning

The static case would look like this:

The operator creates (and maintains) a StorageClass with provisioner = local and a default name (foo)
A user creates a CRD
The operator creates a PersistentVolume object for each Pod in the StatefulSet, with storageClass = foo
The StatefulSet binds the PVC to the PV
The operator needs to respond to increases in the number of Pods, and create PVs accordingly

The dynamic one would look like this:

A cluster-admin enables the DefaultStorageClass admission controller
A cluster-admin defines a StorageClass, with a provisioner field of their choosing (i.e. kubernetes.io/glusterfs)
A user specifies that StorageClass in the CRD
When a Habitat object is created, the PersistentVolume is automatically created and mounted on the Pod
Profit???

Some links:

Let me know if I missed something.

lilic commented 6 years ago

allowing users to define their own storage classes, with dynamic provisioning

I would strongly say we should go for that option, there are many different types of volumes for a reason and each use case needs its own, choosing it for the user beforehand is impossible to predict their desired user case.

jeremymv2 commented 6 years ago

Agreed, Dynamic is the only way to go here if it is to be widely adopted.

When a Habitat object is created, the PersistentVolume is automatically created and mounted on the Pod

I'm curious if it will also be possible to utilize a PersistentVolume that has been pre-provisioned by an admin?

asymmetric commented 6 years ago

@jeremymv2 Yes, that will be possible. It all depends on what StorageClass the user specifies in the CRD. If the StorageClass matches the one provided by an existing PersistentVolume object, that's the one that will be used.

From the docs:

When none of the static PVs the administrator created matches a user’s PersistentVolumeClaim, the cluster may try to dynamically provision a volume specially for the PVC

habitat-sh / habitat-operator

Use StatefulSets instead of Deployment #44