hashicorp / vault-secrets-operator

The Vault Secrets Operator (VSO) allows Pods to consume Vault secrets natively from Kubernetes Secrets.
https://hashicorp.com
Other
471 stars 102 forks source link

VSO is getting OOMKilled on OpenShift cluster #973

Open erzhan46 opened 1 day ago

erzhan46 commented 1 day ago

Describe the bug VSO recently started to get OOMKilled on one of the OpenShift clusters (v.4.14.37). Increasing memory limits to 2Gi and trying to put VSO to guaranteed QOS didn't help. There are several other OpenShift clusters where VSO runs just fine with default resource specs.

To Reproduce

  1. Deployed VSO using standard helm chart increasing memory limit to 2Gi.
  2. Tried to set VSO pod to guaranteed QOS by setting resources specs for manager and kuberbacproxy containers.
  3. In both cases - VSO is getting OOMKilled on one cluster and it runs just fine on several others even with default resource specs.

Expected behavior VSO should run using default resource specs.

Environment

Additional context This seems to be the same issue experienced by others recently.

erzhan46 commented 7 hours ago

This seems to be related to AppRole authentication failures. VSO eventually came up spiking to 2G upon startup and now using1.2G. And it currently logs 'invalid role or secret' errors.

tvoran commented 3 hours ago

Hi @erzhan46, that level of memory usage is unexpected. Are the AppRole authentication failures expected, and unique to this cluster? How many and what kind of secrets are being synced? Are there other auth methods besides AppRole in use?

erzhan46 commented 2 hours ago

Hi @tvoran

We fixed the issue with AppRole authentication - however memory problem still persist. VSO gets OOMKilled several times upon startup before starting successfully. Memory metrics show VSO spikes to about 2G and then runs consistently at 1G. One thing I noticed is the following VSO logs on that cluster. As you can see - 'Objects listed" error: 33246ms' reported probably related to 'SecretTransformation' processing. On other clusters where VSO runs fine - this error is not present.

{"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting EventSource","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","source":"kind source: *v1beta1.SecretTransformation"} {"level":"info","ts":"2024-11-21T16:34:02Z","msg":"Starting Controller","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation"} I1121 16:34:35.727589 1 trace.go:236] Trace[1704102856]: "Reflector ListAndWatch" name:pkg/mod/k8s.io/client-go@v0.30.1/tools/cache/reflector.go:232 (21-Nov-2024 16:34:02.275) (total time: 33451ms): Trace[1704102856]: ---"Objects listed" error: 33246ms (16:34:35.522) Trace[1704102856]: [33.451708284s] [33.451708284s] END {"level":"info","ts":"2024-11-21T16:34:37Z","msg":"Starting workers","controller":"secrettransformation","controllerGroup":"secrets.hashicorp.com","controllerKind":"SecretTransformation","worker count":1}

erzhan46 commented 2 hours ago

There is just a few StaticSecrets synced. Couple SecretsTransformations. Cannot use authentication methods other than AppRole because of the issue with private domain name resolution in Vault instances deployed in HCP.