hashicorp / vault-secrets-operator

The Vault Secrets Operator (VSO) allows Pods to consume Vault secrets natively from Kubernetes Secrets.
https://hashicorp.com
Other
471 stars 102 forks source link

Vault Secrets Operator Manager being OOM killed on busy OpenShift cluster #949

Open kennedn opened 1 month ago

kennedn commented 1 month ago

Describe the bug We are currently using Vault Secrets Operator in our clusters. We have a specific cluster that gets more customer volume than the others and have recently noticed that the vault-secrets-operator-manager pod is being OOM killed after reaching the memory limits outlined in the Operators CSV.

Snippet from the .status key in the OOMkilled pods yaml:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T13:24:28Z"
    message: 'containers with unready status: [manager]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-10-09T10:54:46Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://6ce4d44fa30e22966dbb10cc3ae1dc0df05daf5ea76942d225300b2c9fc2b982
    image: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:8ae1e417a40fb2df575e170128267a4399f56b6bac6db8b30c5b5e2698d0e6f2
    imageID: registry.redhat.io/openshift4/ose-kube-rbac-proxy@sha256:34402817de5c30fb0a2ae0055abce343bd9f84d37ad6cd4dd62820a54aeabfef
    lastState: {}
    name: kube-rbac-proxy
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-10-09T10:55:38Z"
  - containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
    image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    imageID: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
    lastState:
      terminated:
        containerID: cri-o://d80dfb0ca666279c66e96062bace1353ec58ea4ebc4285ba9d7bd96b3ca2ef2f
        exitCode: 137
        finishedAt: "2024-10-09T13:24:27Z"
        reason: OOMKilled
        startedAt: "2024-10-09T13:24:10Z"

To Reproduce Steps to reproduce the behavior:

  1. Deploy vault secrets operator in OpenShift
  2. Make heavy usage of the operator (currently we have 361 static secrets being synced via the operator in this cluster)
  3. Vault secrets operator manager pod begins to enter crash loop, yaml of the pod indicated the reason is OOMkilled.

Application deployment: N/A

Expected behavior CSV for the operator has enough head room in its memory limits to avoid out of memory issues in pod.

Environment

Additional context We have been able to temporarily work around this issue by manually doubling the limits memory value for the manager container in the CSV (from 256Mi to 512Mi) at key .spec.install.spec.deployments[].spec.template.spec.containers[].

- args:
    - --health-probe-bind-address=:8081
    - --metrics-bind-address=127.0.0.1:8080
    - --leader-elect
  command:
    - /vault-secrets-operator
  env:
    - name: OPERATOR_POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name
    - name: OPERATOR_POD_UID
      valueFrom:
        fieldRef:
          fieldPath: metadata.uid
  image: registry.connect.redhat.com/hashicorp/vault-secrets-operator@sha256:78761669829d1a70474b8e30981031138f2fcfcb0ef8f372f26f55e0955839fa
  imagePullPolicy: IfNotPresent
  livenessProbe:
    httpGet:
      path: /healthz
      port: 8081
    initialDelaySeconds: 15
    periodSeconds: 20
  name: manager
  readinessProbe:
    httpGet:
      path: /readyz
      port: 8081
    initialDelaySeconds: 5
    periodSeconds: 10
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 10m
      memory: 128Mi
  securityContext:
    allowPrivilegeEscalation: false
  volumeMounts:
    - mountPath: /var/run/podinfo
      name: podinfo

This is not a permanent fix though since re-installing / upgrading the operator will re-instate the original memory value. We are installing via the Operator Hub in Openshift so do not have a way to permanently affect this value.

tvoran commented 1 month ago

Hi @kennedn, if you set the resource limits for your install in a config stanza in the Subscription for your VSO install, they should survive upgrades. Something like this:

spec:
  config:
    resources:
      requests:
        memory: "128Mi"
        cpu: "10m"
      limits:
        memory: "512Mi"
        cpu: "500m"

(ref: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources)

kennedn commented 1 month ago

Hi @kennedn, if you set the resource limits for your install in a config stanza in the Subscription for your VSO install, they should survive upgrades. Something like this:

spec:
  config:
    resources:
      requests:
        memory: "128Mi"
        cpu: "10m"
      limits:
        memory: "512Mi"
        cpu: "500m"

(ref: https://github.com/operator-framework/operator-lifecycle-manager/blob/master/doc/design/subscription-config.md#resources)

Hi,

We've solved our issue for the moment by utilizing the config section of the vault subscription object as described, we have increased the default memory limit from 256Mi to 512Mi. Thanks for the tip.

Is there any appetite to increase the default memory limit shipped with the operator / chart? I have noticed a few other instances of similar OOMkilled issues raised against this project to date so maybe warrants some thought.

Thanks,

tvoran commented 9 hours ago

Good to hear that helped the issue at least for now. We may want to increase the default limit, though we're still investigating why memory usage seems to spike for some users. Are there only VaultStaticSecrets's in use for your case? What auth methods are being used with these secrets? Are there any errors in the VSO logs? Are there any other differences in workload between the cluster with high memory usage and the others without?

Also v0.5.1 is fairly old at this point, so it would be interesting to see if there's any change in memory usage with a more recent version.