metal-stack / gardener-extension-provider-metal

Implementation of the gardener-extension-controller for metal-stack
MIT License
24 stars 11 forks source link

Shoot migration not working anymore #306

Open mwennrich opened 1 year ago

mwennrich commented 1 year ago
"Waiting until the namespace 'shoot--p4jxn2--mwentest' has been cleaned up and deleted in the Seed cluster...

shoot--p4jxn2--mwentest                        Terminating   20m

NAME                                                                     NAMESPACE                AGE
firewalldeployment.firewall.metal-stack.io/shoot-firewall                shoot--p4jxn2--mwentest  17m
firewall.firewall.metal-stack.io/shoot--p4jxn2--mwentest-firewall-0f0a9  shoot--p4jxn2--mwentest  17m
firewallset.firewall.metal-stack.io/shoot-firewall-0eea7                 shoot--p4jxn2--mwentest  17m

fw2,fwset,fwdeployment objects have a firewall.metal-stack.io/firewall-controller-manager finalizer, but fcm has already been deleted.

After removing the finalizer, migration continues, but after the restore, a new firewall is created, without deleting the old one. This results in a cluster with two firewalls.

$ k get fwmon -n firewall
NAME                                     MACHINE ID                             IMAGE                          SIZE            LAST EVENT    AGE
shoot--p4jxn2--mwentest-firewall-0f0a9   256b1c00-be6d-11e9-8000-3cecef22b288   firewall-ubuntu-3.0.20230404   n1-medium-x86   Phoned Home   35m
shoot--p4jxn2--mwentest-firewall-e3c19   48eb9200-be80-11e9-8000-3cecef22fc1a   firewall-ubuntu-3.0.20230404   n1-medium-x86   Phoned Home   8m30s
Gerrit91 commented 1 year ago

Very rough idea:

Gerrit91 commented 1 year ago

With #308 we can make the firewall survive the shoot migration.

However, as the firewall-controller is now maintaining a seed client for reconciliation, the seed client becomes invalid after a shoot migration. This is because we use a static service account token, which Kubernetes signs with the cluster's CA, which has, of course, changed after the migration. Also the server endpoint has changed after the migration.

Thus, there must be a possibility for the firewall-controller to migrate its client to the new seed. For this, I think we have two options:

If we decide for the second variant, we should also consider migrating away from static service account tokens and instead start rotation of the certificates. Also, we can use bootstrap tokens in order to establish a trusted connection between the firewall-controller and the api-server.

Here is a brief description of how the process could look like:

  1. The firewall gets created with bootstrap kubeconfig through userdata at /etc/firewall-controller/.bootstrap.kubeconfig along with the following roles in the shoot's seed namespace:

    ---
    kind: ClusterRole
    metadata:
    name: firewall.metal-stack.io:system:firewall-bootstrapper
    rules:
    - apiGroups:
    - certificates.k8s.io
    resources:
    - certificatesigningrequests
    verbs:
    - create
    - get
    - apiGroups:
    - certificates.k8s.io
    resources:
    - certificatesigningrequests/firewallcontroller
    verbs:
    - create
    ---
    kind: ClusterRoleBinding
    metadata:
    name: firewall.metal-stack.io:system:firewall-bootstrapper
    roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: firewall.metal-stack.io:system:firewall-bootstrapper
    subjects:
    - kind: Group
    name: system:bootstrappers
    apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: v1
    kind: Secret
    metadata:
    name: bootstrap-token-07401b
    namespace: kube-system
    type: bootstrap.kubernetes.io/token
    stringData:
    description: "Token for bootstrapping the metal-stack firewall-controller."
    token-id: 07401b
    token-secret: f395accd246ae52d
    expiration: <now+60m>
    usage-bootstrap-authentication: "true"
    usage-bootstrap-signing: "true"
    auth-extra-groups: system:bootstrappers
  2. The firewall-controller starts up and uses the bootstrap kubeconfig to issue a certificate signing request (CSR)

  3. The firewall-controller-manager can approve the CSR, enabling the firewall-controller to construct a seed client with the minimal permissions as they currently are implemented.

    apiVersion: certificates.k8s.io/v1
    kind: CertificateSigningRequest
    metadata:
    name: firewall-controller-csr
    spec:
    groups:
    - system:authenticated
    request: <csr>
    signerName: kubernetes.io/kube-apiserver-client
    usages:
    - digital signature
    - key encipherment
    - client auth
    username: shoot--pcfgbt--cilium-firewall-653f3    <-- FCM creates a rolebinding and role for every firewall
    expirationSeconds: <1 year?>
    status:
    certificate: <cert>
    conditions:
    - lastTransitionTime: "2023-06-21T10:39:54Z"
      lastUpdateTime: "2023-06-21T10:39:54Z"
      message: Auto approving firewall-controller client certificate after SubjectAccessReview.
      reason: AutoApproved
      status: "True"
      type: Approved
  4. The firewall-controller writes the seed kubeconfig to /etc/firewall-controller/.seed.kubeconfig

  5. The firewall-controller starts up and uses the shoot access fields from the firewall object to create the shoot client

  6. The shoot client is written to /etc/firewall-controller/.shoot.kubeconfig

  7. The firewall-controller starts up normal operation

    • Asynchronously updates the tokens in the .shoot.kubeconfig and .seed.kubeconfig through the firewall monitor's shoot access fields
  8. The signed certificate for the firewall-controller is continuously checked by the firewall-controller-manager

    • When the certificate becomes invalid (e.g. due to a shoot migration or requested CA roll), a new bootstrap kubeconfig is put to the field in the seed access section in the firewall monitor
  9. If the firewall-controller receives an invalid certificate error with the client, it repeats the initial bootstrap process and creates a new seed client