open-policy-agent / kube-mgmt

Sidecar for managing OPA instances in Kubernetes.
Apache License 2.0
239 stars 106 forks source link

kube-mgmt doesn't reload configmaps if opa container restarts #189

Closed alex0z1 closed 1 year ago

alex0z1 commented 1 year ago

I have the following configuration


---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: opa
  namespace: opa
  name: opa
spec:
  replicas: 1
  selector:
    matchLabels:
      app: opa
  template:
    metadata:
      labels:
        app: opa
      name: opa
    spec:
      containers:
        # WARNING: OPA is NOT running with an authorization policy configured. This
        # means that clients can read and write policies in OPA. If you are
        # deploying OPA in an insecure environment, be sure to configure
        # authentication and authorization on the daemon. See the Security page for
        # details: https://www.openpolicyagent.org/docs/security.html.
        - name: opa
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 128Mi
          image: openpolicyagent/opa:0.49.2-static
          args:
            - "run"
            - "--server"
            - --disable-telemetry
            - "--tls-cert-file=/certs/tls.crt"
            - "--tls-private-key-file=/certs/tls.key"
            - "--addr=0.0.0.0:8443"
            - "--addr=http://127.0.0.1:8181"
            - --authentication=token
            - --authorization=basic
            - /policies/authz.rego
            - --ignore=.*
          volumeMounts:
            - readOnly: true
              mountPath: /certs
              name: opa-server
            - mountPath: /policies
              name: policies
              readOnly: true
          livenessProbe:
              failureThreshold: 3
              httpGet:
                path: /health
                port: 8443
                scheme: HTTPS
              initialDelaySeconds: 3
              periodSeconds: 5
              successThreshold: 1
              timeoutSeconds: 1
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /health
              port: 8443
              scheme: HTTPS
            initialDelaySeconds: 3
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
        - name: kube-mgmt
          volumeMounts:
          - mountPath: /policies
            name: policies
            readOnly: true        
          resources:
            limits:
              cpu: 100m
              memory: 128Mi
            requests:
              cpu: 100m
              memory: 128Mi
          image: openpolicyagent/kube-mgmt:8.0.1
          args:
            - --replicate-cluster=v1/namespaces
            - --replicate=networking.k8s.io/v1/ingresses
            - --replicate=v1/services
            - --replicate=policy/v1/poddisruptionbudgets
            - --opa-auth-token-file=/policies/token
            - --require-policy-label=true
            - --log-level=debug
      volumes:
        - name: opa-server
          secret:
            secretName: opa-server
        - name: policies
          secret:
            secretName: policies

kube-mgmt loads configmaps from opa namespace during first pod initialization but if I kill opa container (for instance by logging into minikube node and do `pkill -f "opa run", or if the liveness probe fails for any reason) then kube-mgmt does not put configmaps into opa container anymore. I have to restart pod (or kill kube-mgmt container) or do some dummy changes in configmaps.

as result OPA container returns 404

{"client_addr":"10.244.0.1:48693","level":"info","msg":"Sent response.","req_id":172,"req_method":"POST","req_path":"/","resp_bytes":86,"resp_duration":0.365959,"resp_status":404,"time":"2023-03-03T22:00:57Z"}

and client gets

k apply -f ingress-bad.yaml -n qa
Error from server (InternalError): error when creating "ingress-bad.yaml": Internal error occurred: failed calling webhook "validating-webhook.openpolicyagent.org": failed to call webhook: the server could not find the requested resource

is there any known workaround for this ? maybe some health check for kube-mgmt to check if opa has rules loaded ? or is there a way to make kube-mgmt periodically put configmaps into OPA container's API ?

alex0z1 commented 1 year ago

maybe changing this to non zero value https://github.com/open-policy-agent/kube-mgmt/blob/8.0.1/pkg/configmap/configmap.go#L151

and in the new else statement here https://github.com/open-policy-agent/kube-mgmt/blob/8.0.1/pkg/configmap/configmap.go#L175-L182 (because configMap version doesn't change when OnUpdate is called by NewInformer) implement a test that will retrieve policies and if the result is empty, call https://github.com/open-policy-agent/kube-mgmt/blob/8.0.1/pkg/configmap/configmap.go#L202

tehlers320 commented 1 year ago

We too see a similar issue and i did notice that the previous version we ran had a 60... https://github.com/open-policy-agent/kube-mgmt/compare/v0.12.1...8.0.0#diff-6aa7780e80409d3ad0fb397be31e6f2d64ab520750d4317267f7138ebcee6606L146

mvaalexp commented 1 year ago

I think these 2 might be the same problem: https://github.com/open-policy-agent/kube-mgmt/issues/194

I think its broken with even 1 replica because when a rollout happens , it brings up a new pod and the listener triggers on the existing pod.

scenario current deployment pod 1 is healthy new release, pod 2 comes up, failure, annotation updated existing pod 1 listener triggers, its already fine so it marks it as ok pod 2 triggers again and it thinks its ok and doesn't load the rule

eshepelyuk commented 1 year ago

Folks if anyone is willing to work on this - I have some ideas how to approach the issue.

alex0z1 commented 1 year ago

I realized that caches need to be reloaded in addition to policies, so it is more complicated than I thought.

Maybe adding liveness probe container to pod can work, use liveness container health endpont in kube-mgmt and if opa container has no policies then liveness container reports failure and kube-mgmt restarts.

Similar to this https://github.com/kubernetes-csi/livenessprobe

eshepelyuk commented 1 year ago

210 and #211 can be implemented to address the bug