version-checker seemingly leaks memory and gets oom-killed

roobre commented 3 years ago

I am running version-checker on a single node, quite small cluster with ~60 pods. So far it is working nicely, but I do not understand the memory behavior it has.

I'm basically running the sample deployment file, plus the --test-all-containers flag and some cpu limits:

        resources:
          requests:
            cpu: 10m
            memory: 32M
          limits:
            cpu: 50m
            memory: 128M

kubectl get pod -o yaml

Over time, I see that version-checker approaches the memory limit and then stays near ~99% for a while. After some time, the kernel kills the ct due to OOM and k8s restarts the pod.

Memory chart

However, I do not see anything alarming in the logs, other than some failures and expected permission errors.

This doesn't seem to have any functional impact, but does fire some alerts and doesn't look good on my dashboards :)

Is this behavior intended, and/or is there any way to prevent it?

trastle commented 3 years ago

We are seeing a similar behaviour while running Version Checker. Would be interested to know if there are recommended values for the limits?

Trede1983 commented 3 years ago

Also seeing something similar with Version Checker getting OOM killed fairly frequently.

davidcollom commented 3 months ago

Hey @Trede1983 @trastle @roobre,

Sorry its taken so long to get back to you on this issue... I have noted that there were some issues around version-checker since these issues have been raised in attempting to reduce the memory footprint.

Things like this are extremely challenging to debug and replicate and it would be amazing to know how many nodes/pods you have in the cluster at the time of this issue, along with the memory/cpu limits/requests you had/have set.

I appreciate that this may be some time ago, and that you may no longer be using version-checker, however this information could be really helpful for us to further understand the memory footprint in larger installations.

In terms of tuning through and changes the main one that comes to mind is #160 along with the already mentioned #69

Disabling test-all-containers and adding the enable.version-checker.io/*my-container* annotations to pods that you care about
Reduce/Increase the image cache timeout (Defaulted to 30minutes) via --image-cache-timeout cli arguments.

erwanval commented 2 months ago

Hello @davidcollom

I'm also encountering this issue. My test cluster is pretty small:

8 nodes (4cpu / 16GB ram)
170 pods
307 containers (208 containers and 99 init containers)
67 distinct images (from docker.io, ghcr.io, quay.io, and registry.k8s.io)

Flag --test-all-containers is set, and only two pods have enable.version-checker.io/*my-container*: false annotation to disable verification (comes from a private registry I haven't configured yet). I also defined use-sha.version-checker.io, match-regex.version-checker.io and override-url.version-checker.io on a bunch of pods, as some images comes from a registry proxy, or have "fake" versions (like grafana).

Version checker is the latest (0.7.0) and installed using helm with the following values:

replicaCount: 1
versionChecker:
  imageCacheTimeout: 30m
  testAllContainers: true

resources:
  # limits:
  #   memory: 128Mi
  requests:
    cpu: 10m
    memory: 128Mi

# This is a temporary fix until the following PR is merged:
# https://github.com/jetstack/version-checker/pull/227
ghcr:
  token: xxxx

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true

If I set resource.limit.memory, version checker is oomkilled every ~6h. I haven't tried running it for more than 1 day without the limit, but I assume it will keep growing. Here is a graph showing the memory usage over time:

erwanval commented 1 month ago

Hello,

Version 0.8.2, the issue still persists. I tried to add the following to the values:

  env:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory

It reduce the frequency of OOMKill to about 1 per day instead of every 6h, but doesn't solve the issue.

jetstack / version-checker

version-checker seemingly leaks memory and gets oom-killed #76