Explore how to manage terraform state on a multi-cluster platform

digitalronin commented 4 years ago

If we have applications running across multiple clusters (even if that's only temporarily, while we replace a cluster), how would we manage terraform state?

Currently, every namespace in the cluster has its own terraform statefile. Some of the items in the terraform state are AWS resources, which should stay the same from one cluster to another, but other things will be kubernetes objects which may need to be created in both/all clusters where the namespace/service exists.

Even for the AWS resources, such as RDS instances, how do we ensure that the kubernetes secrets containing the access credentials are available on all required clusters in the cloud platform?

How would this affect our concourse pipelines?

ghost commented 4 years ago

MANAGING MULTIPLE CLUSTERS

OVERVIEW

The scope of this ticket was to discuss the plan and strategy of what is highlighted in the ticket description. In particular in the scenario of replicating an existing cluster, how we handle and manage the 'interim' period when both the clusters co-exist before tearing down the replaced cluster. The following are the key points brought forward from discussion with wider team

The cluster being replaced needs to stay 'live' until the new cluster is tested as fully working.
The new cluster must replicate 'all' resources such as RDS secrets so that existing applications consuming those secrets do not break
The scope of replication is only for those resources provisioned using terraform's 'kubernetes provider'. Existing AWS resources should not be affected
The new cluster will be provisioned in the same VPC as the one being replaced.

OUTCOMES

To acheive and implement the above requirement it was disucssed that 'Kubernetes Federation' or 'Anthos' are viable solutions. However both currently have impediments, which we would need to address. When possible the next step would be to practically test the following:

Create a 'federated' cluster i.e host
Create a member cluster (replication of the host)
Test the new resources in the new cluster. For example connectivity to an existing RDS using the newly created secret on the new cluseter.

Multiple Clusters.jpg

ghost commented 4 years ago

PR approved / closed: https://github.com/ministryofjustice/cloud-platform/pull/2051

ministryofjustice / cloud-platform

Explore how to manage terraform state on a multi-cluster platform #1817