wave-k8s / wave

Kubernetes configuration tracking controller
Apache License 2.0
681 stars 81 forks source link

Wave not triggering deployment rollout on secret change on Openshift 4.14 #175

Closed abudavis closed 1 month ago

abudavis commented 1 month ago

Wave version: latest as per helm chart Openshift: v4.14.31

Install commands used:

helm repo add wave-k8s https://wave-k8s.github.io/wave/
helm install wave wave-k8s/wave --namespace wave --set syncPeriod=5m --set webhooks.enabled=true --create-namespace

We are trying to get wave to restart a pod from a Deployment, I set the following annotations and changed the secret "mqsicredentials" in "ace" namespace, but nothing happened. Wave is deployed in namespace "wave", the Deployment is deployed in namespace "utilities" and it does not mount the secret "mqsicredentials" as that's not needed & is anyway in a different namespace "ace" anyway.

The "updating instance hash" in the log below came in when I set the above annotations on the deployment (at 11:42 UTC shown in the logs) for the first time at which point it did restart the pod, but after when I updated the secret and waited for much longer than 5 minutes, nothing happens. I have now made multiple attempts to update the secret and nothing! Please help.

kind: Deployment
apiVersion: apps/v1
metadata:
  annotations:
    wave.pusher.com/extra-secrets: ace/mqsicredentials
    wave.pusher.com/update-on-config-change: 'true'
  name: update-acevault
...

Deployment's Pod where the hash annotation has been inserted by wave:

kind: Pod
apiVersion: v1
metadata:
  generateName: update-acevault-858bf7578f-
  annotations:
    k8s.ovn.org/pod-networks: '{"default":{"ip_addresses":["10.130.2.8/23"],"mac_address":"0a:58:0a:82:02:08","gateway_ips":["10.130.2.1"],"routes":[{"dest":"10.128.0.0/14","nextHop":"10.130.2.1"},{"dest":"172.30.0.0/16","nextHop":"10.130.2.1"},{"dest":"100.64.0.0/16","nextHop":"10.130.2.1"}],"ip_address":"10.130.2.8/23","gateway_ip":"10.130.2.1"}}'
    k8s.v1.cni.cncf.io/network-status: |-
      [{
          "name": "ovn-kubernetes",
          "interface": "eth0",
          "ips": [
              "10.130.2.8"
          ],
          "mac": "0a:58:0a:82:02:08",
          "default": true,
          "dns": {}
      }]
    kubectl.kubernetes.io/restartedAt: '2024-06-24T14:56:12Z'
    openshift.io/scc: restricted-v2
    ops.corp.com/triggerrestart: '40020656'
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
    wave.pusher.com/config-hash: 100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d
  resourceVersion: '194634535'
  name: update-acevault-858bf7578f-wkqff

The wave pod logs are as follows. There is also a mutatingwebhookconfiguration for wave, unsure how to check if that works.

2024-10-10T11:35:20Z    INFO    setup   setting up client for manager
2024-10-10T11:35:20Z    INFO    setup   setting up manager
2024-10-10T11:35:20Z    INFO    setup   Registering Components.
2024-10-10T11:35:20Z    INFO    setup   setting up scheme
2024-10-10T11:35:20Z    INFO    setup   Setting up controller
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "apps/v1, Kind=Deployment", "path": "/mutate-apps-v1-deployment"}
2024-10-10T11:35:20Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-apps-v1-deployment"}
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called {"GVK": "apps/v1, Kind=Deployment"}
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "apps/v1, Kind=StatefulSet", "path": "/mutate-apps-v1-statefulset"}
2024-10-10T11:35:20Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-apps-v1-statefulset"}
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called {"GVK": "apps/v1, Kind=StatefulSet"}
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  Registering a mutating webhook  {"GVK": "apps/v1, Kind=DaemonSet", "path": "/mutate-apps-v1-daemonset"}
2024-10-10T11:35:20Z    INFO    controller-runtime.webhook  Registering webhook {"path": "/mutate-apps-v1-daemonset"}
2024-10-10T11:35:20Z    INFO    controller-runtime.builder  skip registering a validating webhook, object does not implement admission.Validator or WithValidator wasn't called {"GVK": "apps/v1, Kind=DaemonSet"}
2024-10-10T11:35:20Z    INFO    setup   Starting the Cmd.
2024-10-10T11:35:20Z    INFO    controller-runtime.metrics  Starting metrics server
2024-10-10T11:35:20Z    INFO    controller-runtime.webhook  Starting webhook server
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "statefulset-controller", "source": "kind source: *v1.StatefulSet"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "deployment-controller", "source": "kind source: *v1.Deployment"}
2024-10-10T11:35:20Z    INFO    controller-runtime.metrics  Serving metrics server  {"bindAddress": ":8080", "secure": false}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "deployment-controller", "source": "kind source: *v1.ConfigMap"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "statefulset-controller", "source": "kind source: *v1.ConfigMap"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "deployment-controller", "source": "kind source: *v1.Secret"}
2024-10-10T11:35:20Z    INFO    Starting Controller {"controller": "deployment-controller"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "statefulset-controller", "source": "kind source: *v1.Secret"}
2024-10-10T11:35:20Z    INFO    Starting Controller {"controller": "statefulset-controller"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "daemonset-controller", "source": "kind source: *v1.DaemonSet"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "daemonset-controller", "source": "kind source: *v1.ConfigMap"}
2024-10-10T11:35:20Z    INFO    Starting EventSource    {"controller": "daemonset-controller", "source": "kind source: *v1.Secret"}
2024-10-10T11:35:20Z    INFO    Starting Controller {"controller": "daemonset-controller"}
2024-10-10T11:35:20Z    INFO    controller-runtime.certwatcher  Updated current TLS certificate
2024-10-10T11:35:20Z    INFO    controller-runtime.webhook  Serving webhook server  {"host": "", "port": 9443}
2024-10-10T11:35:20Z    INFO    controller-runtime.certwatcher  Starting certificate watcher
2024-10-10T11:35:20Z    INFO    Starting workers    {"controller": "statefulset-controller", "worker count": 1}
2024-10-10T11:35:20Z    INFO    Starting workers    {"controller": "deployment-controller", "worker count": 1}
2024-10-10T11:35:20Z    INFO    Starting workers    {"controller": "daemonset-controller", "worker count": 1}
2024/10/10 11:38:28 http: TLS handshake error from 10.128.4.2:46762: EOF
2024-10-10T11:42:21Z    INFO    wave    Updating instance hash  {"namespace": "utilities", "name": "update-acevault", "dryRun": false, "isCreate": false, "hash": "100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d"}
2024-10-10T11:42:21Z    DEBUG   events  Configuration hash updated to 100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d  {"type": "Normal", "object": {"kind":"Deployment","namespace":"utilities","name":"update-acevault","uid":"a488f0db-3af1-4282-8cae-a045e394611a","apiVersion":"apps/v1","resourceVersion":"188476560"}, "reason": "ConfigChanged"}
2024/10/10 11:53:28 http: TLS handshake error from 10.128.4.2:36522: EOF
toelke commented 1 month ago

A quick test using the e2e-test script shows this working.

I am not really used to OpenShift. Is wave allowed to watch the ConfigMaps of the namespace "ace"?

abudavis commented 1 month ago

@toelke Thanks for looking into this. I am not sure how to check that. Wave is in namespace "wave", the secret is in namespace "ace", whereas the deployment that does not mount/use that secret is in namespace "utilities". I haven't used the "--namespaces=" option to limit it, so I have kind of assumed the helm chart has the RBAC needed to make this work, comments?

toelke commented 1 month ago

I have kind of assumed the helm chart has the RBAC needed to make this work

I assumed the same, but needed to lean on your OpenShift experience to confirm that that is supposed to work with OpenShift. I fear I need to break off my investigation soon; I will pick it up tomorrow.

My WIP in changing the tests is here: https://github.com/wave-k8s/wave/commit/748a7ac325ec8bab3ddfc9beddc5f91413e0b1a8

When you run it, it will stop at "Waiting for test to complete" until you change/create either the ConfigMap or Secret "test/test".

abudavis commented 1 month ago

@toelke I checked the cluster role and clusterrolebinding and deleted it to check if wave complains and got a ton of RBAC errors in the pod, which was good, so likely the RBAC is fine.

$ oc get clusterrole wave-wave
NAME        CREATED AT
wave-wave   2024-10-10T11:35:13Z

$ oc get clusterrolebinding wave-wave -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  annotations:
    meta.helm.sh/release-name: wave
    meta.helm.sh/release-namespace: wave
  creationTimestamp: "2024-10-10T11:35:13Z"
  labels:
    app: wave
    app.kubernetes.io/managed-by: Helm
    heritage: Helm
    release: wave
  name: wave-wave
  resourceVersion: "194627029"
  uid: 4c730d61-8f45-4519-88c3-9e384cc85094
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: wave-wave
subjects:
- kind: ServiceAccount
  name: wave-wave
  namespace: wave

$ oc get sa wave-wave
NAME        SECRETS   AGE
wave-wave   1         4h44m

$ oc get sa wave-wave -o yaml
apiVersion: v1
imagePullSecrets:
- name: wave-wave-dockercfg-w74wx
kind: ServiceAccount
metadata:
  annotations:
    meta.helm.sh/release-name: wave
    meta.helm.sh/release-namespace: wave
  creationTimestamp: "2024-10-10T11:35:13Z"
  labels:
    app: wave
    app.kubernetes.io/managed-by: Helm
    heritage: Helm
    release: wave
  name: wave-wave
  namespace: wave
  resourceVersion: "194627052"
  uid: b10bf2a7-5dfa-4549-923b-6a3c9aa23c44
secrets:
- name: wave-wave-dockercfg-w74wx

Next, I deleted the annotations and hash and then added an annotation for secret to a non-existant secret in "ace" and then wave printed this in the log, but it didn't restart the pod which is good as probably it couldn't find the secret, but I'd say wave should have printed that in the log which it didn't.

2024-10-10T15:58:21Z INFO wave Updating instance hash {"namespace": "utilities", "name": "update-acevault", "dryRun": false, "isCreate": false, "hash": "100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d"}
43
2024-10-10T15:58:21Z DEBUG events Configuration hash updated to 100444e91862dd77d7ebe29f050c1e9a7f357c771e1a7b7650aae27e6a3a031d {"type": "Normal", "object": {"kind":"Deployment","namespace":"utilities","name":"update-acevault","uid":"a488f0db-3af1-4282-8cae-a045e394611a","apiVersion":"apps/v1","resourceVersion":"194879739"}, "reason": "ConfigChanged"}

Next, I used the a secret which is in the same namespace "utilities" as the deployment, that did not make any difference whatsoever, I tried both these combinations. So this leads me to believe that the "wave.pusher.com/extra-secrets" is not working on Openshift may be?

wave.pusher.com/extra-secrets: utilities/cpd-cli-apikey
wave.pusher.com/extra-secrets: cpd-cli-apikey

I cant set the deployment to mount the secret as that would reveal anyone with access to the pod to read the mounted secret. The pod is supposed to read the secret, do some stuff real quick and delete the secret.

toelke commented 1 month ago

$ oc get clusterrolebinding wave-wave -o yaml

oc get clusterrole would also be interesting.

but it didn't restart the pod which is good as probably it couldn't find the secret, but I'd say wave should have printed that in the log which it didn't.

If you create the secret, wave will pick it up and restart the Pod.

I cant set the deployment to mount the secret as that would reveal anyone with access to the pod to read the mounted secret.

That is precisely what this feature is for.

Can you show the clusterrole of update-acevault-? To compare why it can read the secret but wave can't?

How many secrets, deployments and configmaps are in your cluster over-all? Are you sure that setting SyncPeriod to such a low value is sensible? The default is 10 hours. Note: This is not about how fast wave will normally react to changes in secrets and configmaps.

abudavis commented 1 month ago

@toelke "If you create the secret, wave will pick it up and restart the Pod."? >>> The secret is already created, may be I misunderstood this, but my understanding is wave is capable of detecting an update or change of an existing secret and change/update/insert a hash on deployment at /spec/template/metadata/annotations?

oc get clusterrole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  annotations:
    meta.helm.sh/release-name: wave
    meta.helm.sh/release-namespace: wave
  creationTimestamp: "2024-10-10T11:35:13Z"
  labels:
    app: wave
    app.kubernetes.io/managed-by: Helm
    heritage: Helm
    release: wave
  name: wave-wave
  resourceVersion: "195903655"
  uid: ccf2db3c-1dc7-4f58-960b-c409471157d5
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - secrets
  verbs:
  - list
  - get
  - update
  - patch
  - watch
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - update
  - patch
- apiGroups:
  - apps
  resources:
  - deployments
  - daemonsets
  - statefulsets
  verbs:
  - list
  - get
  - update
  - patch
  - watch
- apiGroups:
  - coordination.k8s.io
  resources:
  - leases
  verbs:
  - '*'

ACE vault's RBAC:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: acevault
  namespace: utilities
  annotations:
    argocd.argoproj.io/sync-wave: "0"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: edit-acevault3
  namespace: ace
  annotations:
    argocd.argoproj.io/sync-wave: "0"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
- kind: ServiceAccount
  name: acevault
  namespace: utilities
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: null
  name: edit-acevault4
  namespace: utilities
  annotations:
    argocd.argoproj.io/sync-wave: "0"
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: edit
subjects:
- kind: ServiceAccount
  name: acevault
  namespace: utilities

I am not sure how many objects we have, but we have over 2000 pods running on the Openshift cluster. SyncPeriod: I kind of assumed letting it to default 10 hour would mean it would take 10 hours for wave to detect that the secret has changed and trigger the deployment? What does this setting mean really?

toelke commented 1 month ago

The secret is already created, may be I misunderstood this, but my understanding is wave is capable of detecting an update or change of an existing secret

It is also capable of detecting the creation of a new secret. I was referencing this test you did:

Next, I deleted the annotations and hash and then added an annotation for secret to a non-existant secret in "ace" and then wave printed this in the log, but it didn't restart the pod which is good as probably it couldn't find the secret,

[SyncPeriod]: What does this setting mean really?

https://github.com/kubernetes-sigs/controller-runtime/blob/v0.17.2/pkg/cache/cache.go#L146-L171 It is mainly to work around possible bugs where the watch-stream loses updates; in effect, this is a worst-case reaction time for wave.

abudavis commented 1 month ago

@toelke Is it possible to increase the log levels to see what's happening in the background for the wave pod? I am out of options to figure out why this doesn't work on our cluster.

toelke commented 1 month ago

A number of interesting code-paths are not logging; I might work on that.

To make sure: All Secrets and ConfigMaps exist? All mounted ones and all referenced by annotation? If any (non-optional) is missing, wave will not do anything until the full set exists.

abudavis commented 1 month ago

@toelke In our implementation, its just a secret update, no ConfigMap is involved. Oh wait, does the secret need to be annotated as well?

toelke commented 1 month ago

Can you try running the image quay.io/wave-k8s/wave:v0.9.0-extra-logging?

It would print something like

2024-10-11T09:11:26Z    INFO    wave    All children found      {"namespace": "default", "name": "test", "configMaps": "map[default/test:&ConfigMap{...} test/test:&ConfigMap{...}]", "secrets": "map[test/test:&Secret{...}]"}

Showing that wave found two configmaps (default/test and test/test) and one secret (test/test).

abudavis commented 1 month ago

@toelke That image works! Now the deployment is patched when the secret is changed, cool! So it was the image I guess?

This image does not work as intended: quay.io/wave-k8s/wave:v0.8.0

2024-10-11T10:28:38Z    INFO    wave    Updating instance hash  {"namespace": "utilities", "name": "update-acevault", "dryRun": false, "isCreate": false, "hash": "78949eeabaeb36d55ee38257b48ffc695c7fb925ed7cd2989efd272868e5e574"}
2024-10-11T10:28:38Z    DEBUG   events  Configuration hash updated to 78949eeabaeb36d55ee38257b48ffc695c7fb925ed7cd2989efd272868e5e574  {"type": "Normal", "object": {"kind":"Deployment","namespace":"utilities","name":"update-acevault","uid":"a488f0db-3af1-4282-8cae-a045e394611a","apiVersion":"apps/v1","resourceVersion":"196045962"}, "reason": "ConfigChanged"}
2024-10-11T10:28:38Z    INFO    wave    All children found  {"namespace": "utilities", "name": "update-acevault", "configMaps": "map[]", "secrets": "map[ace/mqsicredentials:&Secret{ObjectMeta:{mqsicredentials  ace  15a44b96-624c-456b-9ee9-62463740d509 194972281 0 2024-04-15 20:19:26 +0000 UTC <nil> <nil> map[] map[] [] [] [{Mozilla Update v1 2024-10-10 17:07:28 +0000 UTC FieldsV1 {\"f:data\":{\".\":{},\"f:key\":{}},\"f:type\":{}} }]},Data:map[string][]byte{key: [109 113 115 105 99 114 101 100 101 110 116 105 97 108 115 32 45 45 119 111 114 107 45 100 105 114 32 47 104 111 109 101 47 97 99 101 117 115 101 114 

I changed the image to "quay.io/wave-k8s/wave:v0.9.0" and it works perfectly without the extra logs! So I guess you might want to update your helm chart?

2024-10-11T10:40:10Z    INFO    wave    Updating instance hash  {"namespace": "utilities", "name": "update-acevault", "hash": "d693ab05fdfce3671763971a03921ea13164dca9bfe2f5d14f7fd41ba6f3b3e7"}
2024-10-11T10:40:10Z    DEBUG   events  Configuration hash updated to d693ab05fdfce3671763971a03921ea13164dca9bfe2f5d14f7fd41ba6f3b3e7  {"type": "Normal", "object": {"kind":"Deployment","namespace":"utilities","name":"update-acevault","uid":"a488f0db-3af1-4282-8cae-a045e394611a","apiVersion":"apps/v1","resourceVersion":"196049884"}, "reason": "ConfigChanged"}
toelke commented 1 month ago

Oh. I did not correctly release 0.9.0 to helm :facepalm:

Try chart version 4.3.0.

abudavis commented 1 month ago

I am glad this is solved which also means you can now be certain it works on Openshift 4.x too :) Thank you for all the help!