senthilrch / kube-fledged

A kubernetes operator for creating and managing a cache of container images directly on the cluster worker nodes, so application pods start almost instantly
Apache License 2.0
1.24k stars 118 forks source link

TLS errors when updating kube-fledged helm chart #200

Open robwittman opened 1 year ago

robwittman commented 1 year ago

When the kube-fledged helm chart is redeployed, if the changes don't cause the webhook-server component to restart, any ImageCache operations start failing with

 Internal error occurred: failed calling webhook "validate-image-cache.kubefledged.io": failed to call webhook: Post "https://kube-fledged-webhook-server.kube-system.svc:3443/validate-image-cache?timeout=1s": x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubefledged.io")

It looks like this is because the webhook CA bundle is hardcoded in the helm chart, but when the webhook server is started, init-server generates a new CA bundle and updates the webhook configuration. When another deployment occurs, the original CA bundle is reapplied, and the webhook requests begin to fail, until the webhook component is restarted again to patch the bundle

Is there a best practice for keeping that CA bundle configured appropriately? Would support for an existing Certificate secret make sense?

Steps to reproduce

Install base helm chart

helm install kube-fledged kubefledged-charts/kube-fledged -n kube-system --wait

Deploy a simple image cache

echo 'apiVersion: kubefledged.io/v1alpha2
kind: ImageCache
metadata:
  name: vault
  namespace: kube-system
spec:
  cacheSpec:
  - images:
    - vault:1.9.5
' | kubectl apply -f - 

Update the helm chart, with a value that doesn't restart the webhook server

helm upgrade --install kube-fledged kubefledged-charts/kube-fledged -n kube-system --set 'args.controllerImagePullPolicy=IfNotPresent --wait

If you were to update the ImageCache above, the webhook errors are returned. After restarting the webhook component, they succeed again

anbrsap commented 1 year ago

@senthilrch In 9259975 the service account for the webhook server has been removed. When I helm upgraded from v0.10.0 to (my fork of) v0.10.0 with 9259975 cherry-picked, the webhook server deployment still contained the fields serviceAccount and serviceAccountName with the former value (used Helm v3.11.2). This happens because Helm patches the existing deployment manifest (unless using --force), this does not remove field serviceAccount (deprecated but kept in sync with serviceAccountName) and K8s re-populates serviceAccountName from serviceAccount. Consequently, the new pod could not be started because the service account did not exist anymore.

There are two option to fix this:

  1. Add the following to the Helm template:

          # ensure helm upgrade deletes the formerly used fields
          serviceAccount: ""
          serviceAccountName: ""
  2. Revert the removal of the service account. In general it's a good practice to use a dedicated service account instead of default.

I suggest option 2.

Chili-Man commented 1 year ago

@senthilrch any updates on this ?

aledeulo commented 1 year ago

Hi there. @senthilrch any updates related to the fix of this issue? I'm using zarf to deploy/re-deploy kube-fledged packages, so, in concept, you can do a fresh installation of the helm charts by removing the previous installation with zarf and then creating a new installation package again with zarf, that should be able to deploy everything related to kube-fledged from zero, including the webhook server. I'm having the same issue. Thanks

aledeulo commented 1 year ago

I've found a simple workaround to this issue: Add the following configs to the values.yaml to disable the webhook server and the validation webhook

# Disable webhook server and validation webhook
webhookServer:
  enable: false
validatingWebhook:
  # Specifies whether a validating webhook configuration should be created
  create: false

This is probably not the best solution, but as I've seen in the code and also in the Make file in the deploy-using-yaml option, this is a very known issue and the validation is probably not 100% required.

status:
    completionTime: "2023-08-01T14:59:51Z"
    message: All requested images pulled succesfully to respective nodes
    reason: ImageCacheCreate
    startTime: "2023-08-01T14:59:42Z"
    status: Succeeded